date:20180514

Re: Kernel panic on kernel-3.10.0-693.21.1.el7 in ndisc.h

2018-05-14 Thread Michal Kubecek

On Mon, May 14, 2018 at 02:06:11PM -0700, Stephen Hemminger wrote:
> On Mon, 14 May 2018 21:29:03 +0300 Roman Makhov  
> wrote:
> > Thank you for the answer.
> > Unfortunately CentOS goes with these dinosaurs.
> > So we will try to debug the problem in the current one and try to
> > reproduce on the latest kernel.
> 
> If you are stuck in old kernels, please bug the CentOs maintainers not
> upstream developers.

Actually, it's not even the dinosaur. Kernels from RHEL (and therefore
CentOS) or SLES differ from the original mainline version they are based
on quite a lot. It's possible that this bug didn't really exist in the
old 3.10 and is related to one of the backports added to CentOS (or
rather RHEL in this case) kernel. People outside of the distribution
have little idea what was backported and why so unless the issue can be
reproduced with mainline kernel they cannot be expected to help.

Michal Kubecek

Re: STMMAC driver with TSO enabled issue

2018-05-14 Thread Bhadram Varka


Hi Jose,

On 5/10/2018 9:15 PM, Jose Abreu wrote:



On 10-05-2018 16:08, Bhadram Varka wrote:

Hi Jose,

On 5/10/2018 7:59 PM, Jose Abreu wrote:

Hi Bhadram,

On 10-05-2018 09:55, Jose Abreu wrote:

++net-dev

Hi Bhadram,

On 09-05-2018 12:03, Bhadram Varka wrote:

Hi,

Thanks for responding.

Tried below suggested way. Still observing the issue -

It seems stmmac has a bug in the RX side when using TSO which is
causing all the RX descriptors to be consumed. The stmmac_rx()
function will need to be refactored. I will send a fix ASAP.


Are you using this patch [1] ? Because there is a problem with
the patch. By adding the previously removed call to
stmmac_init_rx_desc() TSO works okay in my setup.



No. I don't have this change in my code base. I am using
net-next tree.

Can you please post the change for which TSO works ? I can help
you with the testing.


It should work with net-next because patch was not merged yet ...
Please send me the output of "dmesg | grep -i stmmac", since boot
and your full register values (from 0x0 to 0x12E4).



[root@alarm ~]# dmesg | grep -i dwc
[6.925005] dwc-eth-dwmac 249.ethernet: Cannot get CSR clock
[6.933657] dwc-eth-dwmac 249.ethernet: no reset control found
[6.955325] dwc-eth-dwmac 249.ethernet: User ID: 0x10, Synopsys 
ID: 0x41

[6.962379] dwc-eth-dwmac 249.ethernet:  DWMAC4/5
[6.967434] dwc-eth-dwmac 249.ethernet: DMA HW capability 
register supported
[6.974827] dwc-eth-dwmac 249.ethernet: RX Checksum Offload 
Engine supported
[6.982915] dwc-eth-dwmac 249.ethernet: TX Checksum insertion 
supported

[6.991235] dwc-eth-dwmac 249.ethernet: Wake-Up On Lan supported
[6.998974] dwc-eth-dwmac 249.ethernet: TSO supported
[7.006422] dwc-eth-dwmac 249.ethernet: TSO feature enabled
[7.012581] dwc-eth-dwmac 249.ethernet: Enable RX Mitigation via 
HW Watchdog Timer
[7.236391] dwc-eth-dwmac 249.ethernet eth0: device MAC address 
4a:d1:e3:58:cb:7a
[7.333414] dwc-eth-dwmac 249.ethernet eth0: IEEE 1588-2008 
Advanced Timestamp supported

[7.342441] dwc-eth-dwmac 249.ethernet eth0: registered PTP clock
[   10.157066] dwc-eth-dwmac 249.ethernet eth0: Link is Up - 
1Gbps/Full - flow control off

[root@alarm ~]# dmesg | grep -i stmma
[7.020567] libphy: stmmac: probed
[7.316295] Broadcom BCM89610 stmmac-0:00: attached PHY driver 
[Broadcom BCM89610] (mii_bus:phy_addr=stmmac-0:00, irq=64)


I will get the register details -

FYI - TSO works fine with single channel. I see the issue only if multi 
channel enabled (supports 4 Tx/Rx channels).


--
Thanks,
Bhadram.

Re: [PATCH RFC net-next 0/7] net/ipv6: Fix route append and replace use cases

2018-05-14 Thread Tobin C. Harding

On Tue, May 15, 2018 at 12:25:03AM -0600, David Ahern wrote:
> On 5/15/18 12:12 AM, Tobin C. Harding wrote:
> > Queue questions on how to run your selftests ;)
> 
> copy tools/testing/selftests/net/fib_tests.sh to test environment.
> 
> Run it with no options to run all tests or to run specific tests:
> 
> $ fib_tests.sh -t ipv4_rt
> $ fib_tests.sh -t ipv6_rt

I was joking but thanks!

Tobin

Re: [PATCH net-next 2/3] gso: limit udp gso to egress-only virtual devices

2018-05-14 Thread kbuild test robot

Hi Willem,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Willem-de-Bruijn/udp-gso-fixes/20180515-120246
config: xtensa-allmodconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=xtensa 

All errors (new ones prefixed by >>):

   drivers/net//team/team.c: In function '__team_compute_features':
>> drivers/net//team/team.c:1030:10: error: 'NETIF_GSO_UDP_L4' undeclared 
>> (first use in this function); did you mean 'NETIF_F_GSO_UDP_L4'?
 NETIF_GSO_UDP_L4;
 ^~~~
 NETIF_F_GSO_UDP_L4
   drivers/net//team/team.c:1030:10: note: each undeclared identifier is 
reported only once for each function it appears in

vim +1030 drivers/net//team/team.c

   996  
   997  #define TEAM_VLAN_FEATURES (NETIF_F_HW_CSUM | NETIF_F_SG | \
   998  NETIF_F_FRAGLIST | NETIF_F_ALL_TSO | \
   999  NETIF_F_HIGHDMA | NETIF_F_LRO)
  1000  
  1001  #define TEAM_ENC_FEATURES   (NETIF_F_HW_CSUM | NETIF_F_SG | \
  1002   NETIF_F_RXCSUM | NETIF_F_ALL_TSO)
  1003  
  1004  static void __team_compute_features(struct team *team)
  1005  {
  1006  struct team_port *port;
  1007  u32 vlan_features = TEAM_VLAN_FEATURES & NETIF_F_ALL_FOR_ALL;
  1008  netdev_features_t enc_features  = TEAM_ENC_FEATURES;
  1009  unsigned short max_hard_header_len = ETH_HLEN;
  1010  unsigned int dst_release_flag = IFF_XMIT_DST_RELEASE |
  1011  IFF_XMIT_DST_RELEASE_PERM;
  1012  
  1013  list_for_each_entry(port, &team->port_list, list) {
  1014  vlan_features = netdev_increment_features(vlan_features,
  1015  port->dev->vlan_features,
  1016  TEAM_VLAN_FEATURES);
  1017  enc_features =
  1018  netdev_increment_features(enc_features,
  1019
port->dev->hw_enc_features,
  1020TEAM_ENC_FEATURES);
  1021  
  1022  
  1023  dst_release_flag &= port->dev->priv_flags;
  1024  if (port->dev->hard_header_len > max_hard_header_len)
  1025  max_hard_header_len = 
port->dev->hard_header_len;
  1026  }
  1027  
  1028  team->dev->vlan_features = vlan_features;
  1029  team->dev->hw_enc_features = enc_features | 
NETIF_F_GSO_ENCAP_ALL |
> 1030   NETIF_GSO_UDP_L4;
  1031  team->dev->hard_header_len = max_hard_header_len;
  1032  
  1033  team->dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
  1034  if (dst_release_flag == (IFF_XMIT_DST_RELEASE | 
IFF_XMIT_DST_RELEASE_PERM))
  1035  team->dev->priv_flags |= IFF_XMIT_DST_RELEASE;
  1036  }
  1037  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [PATCH RFC net-next 0/7] net/ipv6: Fix route append and replace use cases

2018-05-14 Thread David Ahern

On 5/15/18 12:12 AM, Tobin C. Harding wrote:
> Queue questions on how to run your selftests ;)

copy tools/testing/selftests/net/fib_tests.sh to test environment.

Run it with no options to run all tests or to run specific tests:

$ fib_tests.sh -t ipv4_rt
$ fib_tests.sh -t ipv6_rt

Re: possible deadlock in sk_diag_fill

2018-05-14 Thread Andrei Vagin

On Tue, May 15, 2018 at 07:19:39AM +0200, Dmitry Vyukov wrote:
> On Mon, May 14, 2018 at 8:00 PM, Andrei Vagin  wrote:
> >> >> Hello,
> >> >>
> >> >> syzbot found the following crash on:
> >> >>
> >> >> HEAD commit:c1c07416cdd4 Merge tag 'kbuild-fixes-v4.17' of 
> >> >> git://git.k..
> >> >> git tree:   upstream
> >> >> console output: https://syzkaller.appspot.com/x/log.txt?x=12164c9780
> >> >> kernel config:  
> >> >> https://syzkaller.appspot.com/x/.config?x=5a1dc06635c10d27
> >> >> dashboard link: 
> >> >> https://syzkaller.appspot.com/bug?extid=c1872be62e587eae9669
> >> >> compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> >> >> userspace arch: i386
> >> >>
> >> >> Unfortunately, I don't have any reproducer for this crash yet.
> >> >>
> >> >> IMPORTANT: if you fix the bug, please add the following tag to the 
> >> >> commit:
> >> >> Reported-by: syzbot+c1872be62e587eae9...@syzkaller.appspotmail.com
> >> >>
> >> >>
> >> >> ==
> >> >> WARNING: possible circular locking dependency detected
> >> >> 4.17.0-rc3+ #59 Not tainted
> >> >> --
> >> >> syz-executor1/25282 is trying to acquire lock:
> >> >> 4fddf743 (&(&u->lock)->rlock/1){+.+.}, at: sk_diag_dump_icons
> >> >> net/unix/diag.c:82 [inline]
> >> >> 4fddf743 (&(&u->lock)->rlock/1){+.+.}, at:
> >> >> sk_diag_fill.isra.5+0xa43/0x10d0 net/unix/diag.c:144
> >> >>
> >> >> but task is already holding lock:
> >> >> b6895645 (rlock-AF_UNIX){+.+.}, at: spin_lock
> >> >> include/linux/spinlock.h:310 [inline]
> >> >> b6895645 (rlock-AF_UNIX){+.+.}, at: sk_diag_dump_icons
> >> >> net/unix/diag.c:64 [inline]
> >> >> b6895645 (rlock-AF_UNIX){+.+.}, at: 
> >> >> sk_diag_fill.isra.5+0x94e/0x10d0
> >> >> net/unix/diag.c:144
> >> >>
> >> >> which lock already depends on the new lock.
> >> >
> >> > In the code, we have a comment which explains why it is safe to take 
> >> > this lock
> >> >
> >> > /*
> >> >  * The state lock is outer for the same sk's
> >> >  * queue lock. With the other's queue locked it's
> >> >  * OK to lock the state.
> >> >  */
> >> > unix_state_lock_nested(req);
> >> >
> >> > It is a question how to explain this to lockdep.
> >>
> >> Do I understand it correctly that (&u->lock)->rlock associated with
> >> AF_UNIX is locked under rlock-AF_UNIX, and then rlock-AF_UNIX is
> >> locked under (&u->lock)->rlock associated with AF_NETLINK? If so, I
> >> think we need to split (&u->lock)->rlock by family too, so that we
> >> have u->lock-AF_UNIX and u->lock-AF_NETLINK.
> >
> > I think here is another problem. lockdep woried about
> > sk->sk_receive_queue vs unix_sk(s)->lock.
> >
> > sk_diag_dump_icons() takes sk->sk_receive_queue and then
> > unix_sk(s)->lock.
> >
> > unix_dgram_sendmsg takes unix_sk(sk)->lock and then sk->sk_receive_queue.
> >
> > sk_diag_dump_icons() takes locks for two different sockets, but
> > unix_dgram_sendmsg() takes locks for one socket.
> >
> > sk_diag_dump_icons
> > if (sk->sk_state == TCP_LISTEN) {
> > spin_lock(&sk->sk_receive_queue.lock);
> > skb_queue_walk(&sk->sk_receive_queue, skb) {
> > unix_state_lock_nested(req);
> > spin_lock_nested(&unix_sk(s)->lock,
> >
> >
> > unix_dgram_sendmsg
> > unix_state_lock(other)
> > spin_lock(&unix_sk(s)->lock)
> > skb_queue_tail(&other->sk_receive_queue, skb);
> > spin_lock_irqsave(&list->lock, flags);
> 
> 
> Do you mean the following?
> There is socket 1 with state lock (S1) and queue lock (Q2), and socket
> 2 with state lock (S2) and queue lock (Q2). unix_dgram_sendmsg lock
> S1->Q1. And sk_diag_dump_icons locks Q1->S2.
> If yes, then this looks pretty much as deadlock. Consider that 2
> unix_dgram_sendmsg in 2 different threads lock S1 and S2 respectively.
> Now 2  sk_diag_dump_icons in 2 different threads lock Q1 and Q2
> respectively. Now sk_diag_dump_icons want to lock S's, and
> unix_dgram_sendmsg want to lock Q's. Nobody can proceed.

Q1 and S1 belongs to a listen socket, so they can't be taken from
unix_dgram_sendmsg().

Re: [PATCH RFC net-next 0/7] net/ipv6: Fix route append and replace use cases

2018-05-14 Thread Tobin C. Harding

On Mon, May 14, 2018 at 11:54:08PM -0600, David Ahern wrote:
> On 5/14/18 11:21 PM, Tobin C. Harding wrote:
> > Hi David,
> > 
> > On Mon, May 14, 2018 at 07:51:05PM -0700, David Ahern wrote:
> >> This patch set fixes a few append and replace uses cases for IPv6 and
> >> adds test cases that codifies the expectations of how append and replace
> >> are expected to work.
> > 
> > Nood question: what commit does this apply on top of please.  I
> > attempted to apply the set on top of net-next
> > 
> > commit (961423f9fcbc Merge branch 'sctp-Introduce-sctp_flush_ctx')
> > 
> > patch 4 and 5 have merge conflicts (I stopped at 5).
> 
> Base commit is:
> 
> commit 289e1f4e9e4a09c73a1c0152bb93855ea351ccda
> Author: Anders Roxell 
> Date:   Sun May 13 21:48:30 2018 +0200
> 
> net: ipv4: ipconfig: fix unused variable
> 
> 
> 
> I don't see 961423f9fcbc in any tree.

Got it, thanks.

FTR

$ date
Tuesday 15 May  16:05:51 AEST 2018

$ git status
On branch net-next
Your branch is up-to-date with 'net-next/master'.
nothing to commit, working directory clean

$ git log -49 --pretty=oneline --abbrev-commit --reverse | grep 'ipv4: 
ipconfig: fix unused variable'
289e1f4e9e4a net: ipv4: ipconfig: fix unused variable

$ git log -48 --pretty=oneline --abbrev-commit --reverse | grep 'ipv4: 
ipconfig: fix unused variable'



Queue questions on how to run your selftests ;)

thanks,
Tobin.

Re: [PATCH bpf-next v2 0/5] samples: bpf: fix build after move to full libbpf

2018-05-14 Thread Jesper Dangaard Brouer

On Mon, 14 May 2018 22:57:53 -0700
Alexei Starovoitov  wrote:

> On Mon, May 14, 2018 at 10:35:01PM -0700, Jakub Kicinski wrote:
> > Hi!
> > 
> > Following patches address build issues after recent move to libbpf.
> > For out-of-tree builds we would see the following error:
> > 
> > gcc: error: samples/bpf/../../tools/lib/bpf/libbpf.a: No such file or 
> > directory
> > 
> > libbpf build system is now always invoked explicitly rather than
> > relying on building single objects most of the time.  We need to
> > resolve the friction between Kbuild and tools/ build system.
> > 
> > Mini-library called libbpf.h in samples is renamed to bpf_insn.h,
> > using linux/filter.h seems not completely trivial since some samples
> > get upset when order on include search path in changed.  We do have
> > to rename libbpf.h, however, because otherwise it's hard to reliably
> > get to libbpf's header in out-of-tree builds.
> > 
> > v2:
> >  - fix the build error harder (patch 3);
> >  - add patch 5 (make clang less noisy).  
> 
> Applied, Thanks

I just tried it out, but the build fails.

$ make samples/bpf/
  CHK include/config/kernel.release
  CHK include/generated/uapi/linux/version.h
  CHK include/generated/utsrelease.h
  CHK include/generated/bounds.h
  CHK include/generated/timeconst.h
  CHK include/generated/asm-offsets.h
  CALLscripts/checksyscalls.sh
  DESCEND  objtool
  CHK scripts/mod/devicetable-offsets.h
make -C /home/jbrouer/git/kernel/bpf-next/samples/bpf/../../tools/lib/bpf/ 
RM='rm -rf' LDFLAGS= 
srctree=/home/jbrouer/git/kernel/bpf-next/samples/bpf/../../ O=

Auto-detecting system features:
...libelf: [ OFF ]
...   bpf: [ OFF ]

No libelf found
make[2]: *** [Makefile:212: elfdep] Error 1
make[1]: *** [samples/bpf/Makefile:207: 
/home/jbrouer/git/kernel/bpf-next/samples/bpf/../../tools/lib/bpf/libbpf.a] 
Error 2
make: *** [Makefile:1743: samples/bpf/] Error 2
 

But it might not be related to this change, as the problem seems to the
with the "Auto-detecting system features".  It is the build of libbpf
that is the problem.

$ cd tools/lib/bpf/
$ make

Auto-detecting system features:
...libelf: [ OFF ]
...   bpf: [ OFF ]

No libelf found
make: *** [Makefile:212: elfdep] Error 1


And do have 'libelf' installed on this system... the same build/make
commands works on net-next.

On net-next I see:

$ cd tools/lib/bpf/
$ make

Auto-detecting system features:
...libelf: [ on  ]
...   bpf: [ on  ]

Warning: Kernel ABI header at 'tools/include/uapi/linux/bpf.h' differs from 
latest version at 'include/uapi/linux/bpf.h'
Warning: Kernel ABI header at 'tools/include/uapi/linux/if_link.h' differs from 
latest version at 'include/uapi/linux/if_link.h'
  CC   libbpf.o
  CC   bpf.o
  CC   nlattr.o
  CC   btf.o
  LD   libbpf-in.o
  LINK libbpf.a
  LINK libbpf.so


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [bpf-next PATCH 0/5] bpf, doc: convert Documentation/bpf to RST-formatting

2018-05-14 Thread Alexei Starovoitov

On Mon, May 14, 2018 at 10:57:25PM -0700, Y Song wrote:
> On Mon, May 14, 2018 at 6:42 AM, Jesper Dangaard Brouer
>  wrote:
> > The kernel is moving files under Documentation to use the RST
> > (reStructuredText) format and Sphinx [1].  This patchset converts the
> > files under Documentation/bpf/ into RST format.  The Sphinx
> > integration is left as followup work.
> >
> > [1] https://www.kernel.org/doc/html/latest/doc-guide/sphinx.html
> >
> > This patchset have been uploaded as branch bpf_doc10 on github[2], so
> > reviewers can see how GitHub renders this.
> >
> > [2] https://github.com/netoptimizer/linux/tree/bpf_doc10/Documentation/bpf
> >
> > ---
> >
> > Jesper Dangaard Brouer (5):
> >   bpf, doc: add basic README.rst file
> >   bpf, doc: rename txt files to rst files
> >   bpf, doc: convert bpf_design_QA.rst to use RST formatting
> >   bpf, doc: convert bpf_devel_QA.rst to use RST formatting
> >   bpf, doc: howto use/run the BPF selftests
> 
> This initial conversion from .txt to .rst files look good to me. Ack
> for the whole series.
> Acked-by: Yonghong Song 

Applied, Thanks

Re: [PATCH bpf-next v2 3/5] samples: bpf: fix build after move to compiling full libbpf.a

2018-05-14 Thread Björn Töpel

2018-05-15 7:35 GMT+02:00 Jakub Kicinski :
> There are many ways users may compile samples, some of them got
> broken by commit 5f9380572b4b ("samples: bpf: compile and link
> against full libbpf").  Improve path resolution and make libbpf
> building a dependency of source files to force its build.
>
> Samples should now again build with any of:
>  cd samples/bpf; make
>  make samples/bpf/
>  make -C samples/bpf
>  cd samples/bpf; make O=builddir
>  make samples/bpf/ O=builddir
>  make -C samples/bpf O=builddir
>  export KBUILD_OUTPUT=builddir
>  make samples/bpf/
>  make -C samples/bpf
>
> Fixes: 5f9380572b4b ("samples: bpf: compile and link against full libbpf")
> Reported-by: Björn Töpel 
> Signed-off-by: Jakub Kicinski 
> ---
>  samples/bpf/Makefile | 17 -
>  1 file changed, 12 insertions(+), 5 deletions(-)
>
> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> index 9e255ca4059a..0dae77c88d2e 100644
> --- a/samples/bpf/Makefile
> +++ b/samples/bpf/Makefile
> @@ -1,4 +1,8 @@
>  # SPDX-License-Identifier: GPL-2.0
> +
> +BPF_SAMPLES_PATH ?= $(abspath $(srctree)/$(src))
> +TOOLS_PATH := $(BPF_SAMPLES_PATH)/../../tools
> +
>  # List of programs to build
>  hostprogs-y := test_lru_dist
>  hostprogs-y += sock_example
> @@ -49,7 +53,8 @@ hostprogs-y += xdpsock
>  hostprogs-y += xdp_fwd
>
>  # Libbpf dependencies
> -LIBBPF := ../../tools/lib/bpf/libbpf.a
> +LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
> +
>  CGROUP_HELPERS := ../../tools/testing/selftests/bpf/cgroup_helpers.o
>  TRACE_HELPERS := ../../tools/testing/selftests/bpf/trace_helpers.o
>
> @@ -233,15 +238,16 @@ CLANG_ARCH_ARGS = -target $(ARCH)
>  endif
>
>  # Trick to allow make to be run from this directory
> -all: $(LIBBPF)
> -   $(MAKE) -C ../../ $(CURDIR)/
> +all:
> +   $(MAKE) -C ../../ $(CURDIR)/ BPF_SAMPLES_PATH=$(CURDIR)
>
>  clean:
> $(MAKE) -C ../../ M=$(CURDIR) clean
> @rm -f *~
>
>  $(LIBBPF): FORCE
> -   $(MAKE) -C $(dir $@)
> +# Fix up variables inherited from Kbuild that tools/ build system won't like
> +   $(MAKE) -C $(dir $@) RM='rm -rf' LDFLAGS= 
> srctree=$(BPF_SAMPLES_PATH)/../../ O=
>
>  $(obj)/syscall_nrs.s:  $(src)/syscall_nrs.c
> $(call if_changed_dep,cc_s_c)
> @@ -272,7 +278,8 @@ verify_target_bpf: verify_cmds
> exit 2; \
> else true; fi
>
> -$(src)/*.c: verify_target_bpf
> +$(BPF_SAMPLES_PATH)/*.c: verify_target_bpf $(LIBBPF)
> +$(src)/*.c: verify_target_bpf $(LIBBPF)
>
>  $(obj)/tracex5_kern.o: $(obj)/syscall_nrs.h
>
> --
> 2.17.0
>


Thanks a bunch for this!

Tested-by: Björn Töpel

Re: [bpf-next PATCH 0/5] bpf, doc: convert Documentation/bpf to RST-formatting

2018-05-14 Thread Y Song

On Mon, May 14, 2018 at 6:42 AM, Jesper Dangaard Brouer
 wrote:
> The kernel is moving files under Documentation to use the RST
> (reStructuredText) format and Sphinx [1].  This patchset converts the
> files under Documentation/bpf/ into RST format.  The Sphinx
> integration is left as followup work.
>
> [1] https://www.kernel.org/doc/html/latest/doc-guide/sphinx.html
>
> This patchset have been uploaded as branch bpf_doc10 on github[2], so
> reviewers can see how GitHub renders this.
>
> [2] https://github.com/netoptimizer/linux/tree/bpf_doc10/Documentation/bpf
>
> ---
>
> Jesper Dangaard Brouer (5):
>   bpf, doc: add basic README.rst file
>   bpf, doc: rename txt files to rst files
>   bpf, doc: convert bpf_design_QA.rst to use RST formatting
>   bpf, doc: convert bpf_devel_QA.rst to use RST formatting
>   bpf, doc: howto use/run the BPF selftests

This initial conversion from .txt to .rst files look good to me. Ack
for the whole series.
Acked-by: Yonghong Song 

>
>
>  Documentation/bpf/README.rst|   36 ++
>  Documentation/bpf/bpf_design_QA.rst |  221 
>  Documentation/bpf/bpf_design_QA.txt |  156 -
>  Documentation/bpf/bpf_devel_QA.rst  |  640 
> +++
>  Documentation/bpf/bpf_devel_QA.txt  |  570 ---
>  5 files changed, 897 insertions(+), 726 deletions(-)
>  create mode 100644 Documentation/bpf/README.rst
>  create mode 100644 Documentation/bpf/bpf_design_QA.rst
>  delete mode 100644 Documentation/bpf/bpf_design_QA.txt
>  create mode 100644 Documentation/bpf/bpf_devel_QA.rst
>  delete mode 100644 Documentation/bpf/bpf_devel_QA.txt
>
> --

Re: [PATCH bpf-next v2 0/5] samples: bpf: fix build after move to full libbpf

2018-05-14 Thread Alexei Starovoitov

On Mon, May 14, 2018 at 10:35:01PM -0700, Jakub Kicinski wrote:
> Hi!
> 
> Following patches address build issues after recent move to libbpf.
> For out-of-tree builds we would see the following error:
> 
> gcc: error: samples/bpf/../../tools/lib/bpf/libbpf.a: No such file or 
> directory
> 
> libbpf build system is now always invoked explicitly rather than
> relying on building single objects most of the time.  We need to
> resolve the friction between Kbuild and tools/ build system.
> 
> Mini-library called libbpf.h in samples is renamed to bpf_insn.h,
> using linux/filter.h seems not completely trivial since some samples
> get upset when order on include search path in changed.  We do have
> to rename libbpf.h, however, because otherwise it's hard to reliably
> get to libbpf's header in out-of-tree builds.
> 
> v2:
>  - fix the build error harder (patch 3);
>  - add patch 5 (make clang less noisy).

Applied, Thanks

Re: [PATCH RFC net-next 0/7] net/ipv6: Fix route append and replace use cases

2018-05-14 Thread David Ahern

On 5/14/18 11:21 PM, Tobin C. Harding wrote:
> Hi David,
> 
> On Mon, May 14, 2018 at 07:51:05PM -0700, David Ahern wrote:
>> This patch set fixes a few append and replace uses cases for IPv6 and
>> adds test cases that codifies the expectations of how append and replace
>> are expected to work.
> 
> Nood question: what commit does this apply on top of please.  I
> attempted to apply the set on top of net-next
> 
>   commit (961423f9fcbc Merge branch 'sctp-Introduce-sctp_flush_ctx')
> 
> patch 4 and 5 have merge conflicts (I stopped at 5).

Base commit is:

commit 289e1f4e9e4a09c73a1c0152bb93855ea351ccda
Author: Anders Roxell 
Date:   Sun May 13 21:48:30 2018 +0200

net: ipv4: ipconfig: fix unused variable



I don't see 961423f9fcbc in any tree.

[PATCH bpf-next v2 4/5] samples: bpf: move libbpf from object dependencies to libs

2018-05-14 Thread Jakub Kicinski

Make complains that it doesn't know how to make libbpf.a:

scripts/Makefile.host:106: target 'samples/bpf/../../tools/lib/bpf/libbpf.a' 
doesn't match the target pattern

Now that we have it as a dependency of the sources simply add libbpf.a
to libraries not objects.

Signed-off-by: Jakub Kicinski 
Acked-by: Jesper Dangaard Brouer 
---
 samples/bpf/Makefile | 145 +++
 1 file changed, 51 insertions(+), 94 deletions(-)

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 0dae77c88d2e..0036a77c2d97 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -58,55 +58,53 @@ LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
 CGROUP_HELPERS := ../../tools/testing/selftests/bpf/cgroup_helpers.o
 TRACE_HELPERS := ../../tools/testing/selftests/bpf/trace_helpers.o
 
-test_lru_dist-objs := test_lru_dist.o $(LIBBPF)
-sock_example-objs := sock_example.o $(LIBBPF)
-fds_example-objs := bpf_load.o $(LIBBPF) fds_example.o
-sockex1-objs := bpf_load.o $(LIBBPF) sockex1_user.o
-sockex2-objs := bpf_load.o $(LIBBPF) sockex2_user.o
-sockex3-objs := bpf_load.o $(LIBBPF) sockex3_user.o
-tracex1-objs := bpf_load.o $(LIBBPF) tracex1_user.o
-tracex2-objs := bpf_load.o $(LIBBPF) tracex2_user.o
-tracex3-objs := bpf_load.o $(LIBBPF) tracex3_user.o
-tracex4-objs := bpf_load.o $(LIBBPF) tracex4_user.o
-tracex5-objs := bpf_load.o $(LIBBPF) tracex5_user.o
-tracex6-objs := bpf_load.o $(LIBBPF) tracex6_user.o
-tracex7-objs := bpf_load.o $(LIBBPF) tracex7_user.o
-load_sock_ops-objs := bpf_load.o $(LIBBPF) load_sock_ops.o
-test_probe_write_user-objs := bpf_load.o $(LIBBPF) test_probe_write_user_user.o
-trace_output-objs := bpf_load.o $(LIBBPF) trace_output_user.o $(TRACE_HELPERS)
-lathist-objs := bpf_load.o $(LIBBPF) lathist_user.o
-offwaketime-objs := bpf_load.o $(LIBBPF) offwaketime_user.o $(TRACE_HELPERS)
-spintest-objs := bpf_load.o $(LIBBPF) spintest_user.o $(TRACE_HELPERS)
-map_perf_test-objs := bpf_load.o $(LIBBPF) map_perf_test_user.o
-test_overhead-objs := bpf_load.o $(LIBBPF) test_overhead_user.o
-test_cgrp2_array_pin-objs := test_cgrp2_array_pin.o $(LIBBPF)
-test_cgrp2_attach-objs := test_cgrp2_attach.o $(LIBBPF)
-test_cgrp2_attach2-objs := test_cgrp2_attach2.o $(LIBBPF) $(CGROUP_HELPERS)
-test_cgrp2_sock-objs := test_cgrp2_sock.o $(LIBBPF)
-test_cgrp2_sock2-objs := bpf_load.o $(LIBBPF) test_cgrp2_sock2.o
-xdp1-objs := xdp1_user.o $(LIBBPF)
+fds_example-objs := bpf_load.o fds_example.o
+sockex1-objs := bpf_load.o sockex1_user.o
+sockex2-objs := bpf_load.o sockex2_user.o
+sockex3-objs := bpf_load.o sockex3_user.o
+tracex1-objs := bpf_load.o tracex1_user.o
+tracex2-objs := bpf_load.o tracex2_user.o
+tracex3-objs := bpf_load.o tracex3_user.o
+tracex4-objs := bpf_load.o tracex4_user.o
+tracex5-objs := bpf_load.o tracex5_user.o
+tracex6-objs := bpf_load.o tracex6_user.o
+tracex7-objs := bpf_load.o tracex7_user.o
+load_sock_ops-objs := bpf_load.o load_sock_ops.o
+test_probe_write_user-objs := bpf_load.o test_probe_write_user_user.o
+trace_output-objs := bpf_load.o trace_output_user.o $(TRACE_HELPERS)
+lathist-objs := bpf_load.o lathist_user.o
+offwaketime-objs := bpf_load.o offwaketime_user.o $(TRACE_HELPERS)
+spintest-objs := bpf_load.o spintest_user.o $(TRACE_HELPERS)
+map_perf_test-objs := bpf_load.o map_perf_test_user.o
+test_overhead-objs := bpf_load.o test_overhead_user.o
+test_cgrp2_array_pin-objs := test_cgrp2_array_pin.o
+test_cgrp2_attach-objs := test_cgrp2_attach.o
+test_cgrp2_attach2-objs := test_cgrp2_attach2.o $(CGROUP_HELPERS)
+test_cgrp2_sock-objs := test_cgrp2_sock.o
+test_cgrp2_sock2-objs := bpf_load.o test_cgrp2_sock2.o
+xdp1-objs := xdp1_user.o
 # reuse xdp1 source intentionally
-xdp2-objs := xdp1_user.o $(LIBBPF)
-xdp_router_ipv4-objs := bpf_load.o $(LIBBPF) xdp_router_ipv4_user.o
-test_current_task_under_cgroup-objs := bpf_load.o $(LIBBPF) $(CGROUP_HELPERS) \
+xdp2-objs := xdp1_user.o
+xdp_router_ipv4-objs := bpf_load.o xdp_router_ipv4_user.o
+test_current_task_under_cgroup-objs := bpf_load.o $(CGROUP_HELPERS) \
   test_current_task_under_cgroup_user.o
-trace_event-objs := bpf_load.o $(LIBBPF) trace_event_user.o $(TRACE_HELPERS)
-sampleip-objs := bpf_load.o $(LIBBPF) sampleip_user.o $(TRACE_HELPERS)
-tc_l2_redirect-objs := bpf_load.o $(LIBBPF) tc_l2_redirect_user.o
-lwt_len_hist-objs := bpf_load.o $(LIBBPF) lwt_len_hist_user.o
-xdp_tx_iptunnel-objs := bpf_load.o $(LIBBPF) xdp_tx_iptunnel_user.o
-test_map_in_map-objs := bpf_load.o $(LIBBPF) test_map_in_map_user.o
-per_socket_stats_example-objs := cookie_uid_helper_example.o $(LIBBPF)
-xdp_redirect-objs := bpf_load.o $(LIBBPF) xdp_redirect_user.o
-xdp_redirect_map-objs := bpf_load.o $(LIBBPF) xdp_redirect_map_user.o
-xdp_redirect_cpu-objs := bpf_load.o $(LIBBPF) xdp_redirect_cpu_user.o
-xdp_monitor-objs := bpf_load.o $(LIBBPF) xdp_monitor_user.o
-xdp_rxq_info-objs := xdp_rxq_info_user.o $(LIBBPF)
-syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
-cpustat-objs := bpf_load.o $(LIBBP

[PATCH bpf-next v2 2/5] samples: bpf: rename libbpf.h to bpf_insn.h

2018-05-14 Thread Jakub Kicinski

The libbpf.h file in samples is clashing with libbpf's header.
Since it only includes a subset of filter.h instruction helpers
rename it to bpf_insn.h.  Drop the unnecessary include of bpf/bpf.h.

Signed-off-by: Jakub Kicinski 
Acked-by: Jesper Dangaard Brouer 
---
 samples/bpf/{libbpf.h => bpf_insn.h}| 8 +++-
 samples/bpf/cookie_uid_helper_example.c | 2 +-
 samples/bpf/fds_example.c   | 4 +++-
 samples/bpf/sock_example.c  | 3 ++-
 samples/bpf/test_cgrp2_attach.c | 3 ++-
 samples/bpf/test_cgrp2_attach2.c| 3 ++-
 samples/bpf/test_cgrp2_sock.c   | 3 ++-
 samples/bpf/test_cgrp2_sock2.c  | 3 ++-
 8 files changed, 17 insertions(+), 12 deletions(-)
 rename samples/bpf/{libbpf.h => bpf_insn.h} (98%)

diff --git a/samples/bpf/libbpf.h b/samples/bpf/bpf_insn.h
similarity index 98%
rename from samples/bpf/libbpf.h
rename to samples/bpf/bpf_insn.h
index 18bfee5aab6b..20dc5cefec84 100644
--- a/samples/bpf/libbpf.h
+++ b/samples/bpf/bpf_insn.h
@@ -1,9 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
-/* eBPF mini library */
-#ifndef __LIBBPF_H
-#define __LIBBPF_H
-
-#include 
+/* eBPF instruction mini library */
+#ifndef __BPF_INSN_H
+#define __BPF_INSN_H
 
 struct bpf_insn;
 
diff --git a/samples/bpf/cookie_uid_helper_example.c 
b/samples/bpf/cookie_uid_helper_example.c
index 8eca27e595ae..deb0e3e0324d 100644
--- a/samples/bpf/cookie_uid_helper_example.c
+++ b/samples/bpf/cookie_uid_helper_example.c
@@ -51,7 +51,7 @@
 #include 
 #include 
 #include 
-#include "libbpf.h"
+#include "bpf_insn.h"
 
 #define PORT 
 
diff --git a/samples/bpf/fds_example.c b/samples/bpf/fds_example.c
index e29bd52ff9e8..9854854f05d1 100644
--- a/samples/bpf/fds_example.c
+++ b/samples/bpf/fds_example.c
@@ -12,8 +12,10 @@
 #include 
 #include 
 
+#include 
+
+#include "bpf_insn.h"
 #include "bpf_load.h"
-#include "libbpf.h"
 #include "sock_example.h"
 
 #define BPF_F_PIN  (1 << 0)
diff --git a/samples/bpf/sock_example.c b/samples/bpf/sock_example.c
index 33a637507c00..60ec467c78ab 100644
--- a/samples/bpf/sock_example.c
+++ b/samples/bpf/sock_example.c
@@ -26,7 +26,8 @@
 #include 
 #include 
 #include 
-#include "libbpf.h"
+#include 
+#include "bpf_insn.h"
 #include "sock_example.h"
 
 char bpf_log_buf[BPF_LOG_BUF_SIZE];
diff --git a/samples/bpf/test_cgrp2_attach.c b/samples/bpf/test_cgrp2_attach.c
index 4bfcaf93fcf3..20fbd1241db3 100644
--- a/samples/bpf/test_cgrp2_attach.c
+++ b/samples/bpf/test_cgrp2_attach.c
@@ -28,8 +28,9 @@
 #include 
 
 #include 
+#include 
 
-#include "libbpf.h"
+#include "bpf_insn.h"
 
 enum {
MAP_KEY_PACKETS,
diff --git a/samples/bpf/test_cgrp2_attach2.c b/samples/bpf/test_cgrp2_attach2.c
index 1af412ec6007..b453e6a161be 100644
--- a/samples/bpf/test_cgrp2_attach2.c
+++ b/samples/bpf/test_cgrp2_attach2.c
@@ -24,8 +24,9 @@
 #include 
 
 #include 
+#include 
 
-#include "libbpf.h"
+#include "bpf_insn.h"
 #include "cgroup_helpers.h"
 
 #define FOO"/foo"
diff --git a/samples/bpf/test_cgrp2_sock.c b/samples/bpf/test_cgrp2_sock.c
index e79594dd629b..b0811da5a00f 100644
--- a/samples/bpf/test_cgrp2_sock.c
+++ b/samples/bpf/test_cgrp2_sock.c
@@ -21,8 +21,9 @@
 #include 
 #include 
 #include 
+#include 
 
-#include "libbpf.h"
+#include "bpf_insn.h"
 
 char bpf_log_buf[BPF_LOG_BUF_SIZE];
 
diff --git a/samples/bpf/test_cgrp2_sock2.c b/samples/bpf/test_cgrp2_sock2.c
index e53f1f6f0867..3b5be2364975 100644
--- a/samples/bpf/test_cgrp2_sock2.c
+++ b/samples/bpf/test_cgrp2_sock2.c
@@ -19,8 +19,9 @@
 #include 
 #include 
 #include 
+#include 
 
-#include "libbpf.h"
+#include "bpf_insn.h"
 #include "bpf_load.h"
 
 static int usage(const char *argv0)
-- 
2.17.0

Re: [PATCH v1 4/4] samples/bpf: an example of a raw IR decoder

2018-05-14 Thread Y Song

On Mon, May 14, 2018 at 2:11 PM, Sean Young  wrote:
> This implements the grundig-16 IR protocol.
>
> Signed-off-by: Sean Young 
> ---
>  samples/bpf/Makefile  |   4 +
>  samples/bpf/bpf_load.c|   9 +-
>  samples/bpf/grundig_decoder_kern.c| 112 ++
>  samples/bpf/grundig_decoder_user.c|  54 +++
>  tools/bpf/bpftool/prog.c  |   1 +
>  tools/include/uapi/linux/bpf.h|  17 +++-
>  tools/testing/selftests/bpf/bpf_helpers.h |   6 ++
>  7 files changed, 200 insertions(+), 3 deletions(-)
>  create mode 100644 samples/bpf/grundig_decoder_kern.c
>  create mode 100644 samples/bpf/grundig_decoder_user.c
>
> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> index 4d6a6edd4bf6..c6fa111f103a 100644
> --- a/samples/bpf/Makefile
> +++ b/samples/bpf/Makefile
> @@ -44,6 +44,7 @@ hostprogs-y += xdp_monitor
>  hostprogs-y += xdp_rxq_info
>  hostprogs-y += syscall_tp
>  hostprogs-y += cpustat
> +hostprogs-y += grundig_decoder
>
>  # Libbpf dependencies
>  LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
> @@ -95,6 +96,7 @@ xdp_monitor-objs := bpf_load.o $(LIBBPF) xdp_monitor_user.o
>  xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o
>  syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
>  cpustat-objs := bpf_load.o $(LIBBPF) cpustat_user.o
> +grundig_decoder-objs := bpf_load.o $(LIBBPF) grundig_decoder_user.o
>
>  # Tell kbuild to always build the programs
>  always := $(hostprogs-y)
> @@ -148,6 +150,7 @@ always += xdp_rxq_info_kern.o
>  always += xdp2skb_meta_kern.o
>  always += syscall_tp_kern.o
>  always += cpustat_kern.o
> +always += grundig_decoder_kern.o
>
>  HOSTCFLAGS += -I$(objtree)/usr/include
>  HOSTCFLAGS += -I$(srctree)/tools/lib/
> @@ -193,6 +196,7 @@ HOSTLOADLIBES_xdp_monitor += -lelf
>  HOSTLOADLIBES_xdp_rxq_info += -lelf
>  HOSTLOADLIBES_syscall_tp += -lelf
>  HOSTLOADLIBES_cpustat += -lelf
> +HOSTLOADLIBES_grundig_decoder += -lelf
>
>  # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
> cmdline:
>  #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
> CLANG=~/git/llvm/build/bin/clang
> diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
> index bebe4188b4b3..0fd389e95bb9 100644
> --- a/samples/bpf/bpf_load.c
> +++ b/samples/bpf/bpf_load.c
> @@ -69,6 +69,7 @@ static int load_and_attach(const char *event, struct 
> bpf_insn *prog, int size)
> bool is_sockops = strncmp(event, "sockops", 7) == 0;
> bool is_sk_skb = strncmp(event, "sk_skb", 6) == 0;
> bool is_sk_msg = strncmp(event, "sk_msg", 6) == 0;
> +   bool is_ir_decoder = strncmp(event, "ir_decoder", 10) == 0;
> size_t insns_cnt = size / sizeof(struct bpf_insn);
> enum bpf_prog_type prog_type;
> char buf[256];
> @@ -102,6 +103,8 @@ static int load_and_attach(const char *event, struct 
> bpf_insn *prog, int size)
> prog_type = BPF_PROG_TYPE_SK_SKB;
> } else if (is_sk_msg) {
> prog_type = BPF_PROG_TYPE_SK_MSG;
> +   } else if (is_ir_decoder) {
> +   prog_type = BPF_PROG_TYPE_RAWIR_DECODER;
> } else {
> printf("Unknown event '%s'\n", event);
> return -1;
> @@ -116,7 +119,8 @@ static int load_and_attach(const char *event, struct 
> bpf_insn *prog, int size)
>
> prog_fd[prog_cnt++] = fd;
>
> -   if (is_xdp || is_perf_event || is_cgroup_skb || is_cgroup_sk)
> +   if (is_xdp || is_perf_event || is_cgroup_skb || is_cgroup_sk ||
> +   is_ir_decoder)
> return 0;
>
> if (is_socket || is_sockops || is_sk_skb || is_sk_msg) {
> @@ -607,7 +611,8 @@ static int do_load_bpf_file(const char *path, 
> fixup_map_cb fixup_map)
> memcmp(shname, "cgroup/", 7) == 0 ||
> memcmp(shname, "sockops", 7) == 0 ||
> memcmp(shname, "sk_skb", 6) == 0 ||
> -   memcmp(shname, "sk_msg", 6) == 0) {
> +   memcmp(shname, "sk_msg", 6) == 0 ||
> +   memcmp(shname, "ir_decoder", 10) == 0) {
> ret = load_and_attach(shname, data->d_buf,
>   data->d_size);
> if (ret != 0)
> diff --git a/samples/bpf/grundig_decoder_kern.c 
> b/samples/bpf/grundig_decoder_kern.c
> new file mode 100644
> index ..c80f2c9cc69a
> --- /dev/null
> +++ b/samples/bpf/grundig_decoder_kern.c
> @@ -0,0 +1,112 @@
> +
> +#include 
> +#include 
> +#include "bpf_helpers.h"
> +#include 
> +
> +enum grundig_state {
> +   STATE_INACTIVE,
> +   STATE_HEADER_SPACE,
> +   STATE_LEADING_PULSE,
> +   STATE_BITS_SPACE,
> +   STATE_BITS_PULSE,
> +};
> +
> +struct decoder_state {
> +   u32 bits;
> +   enum grundig_state state;
> +   u32 count;
> +   u32 last_space;
> +};
> +
> +struct bpf_map_def SEC("maps") decoder_state_map

[PATCH bpf-next v2 3/5] samples: bpf: fix build after move to compiling full libbpf.a

2018-05-14 Thread Jakub Kicinski

There are many ways users may compile samples, some of them got
broken by commit 5f9380572b4b ("samples: bpf: compile and link
against full libbpf").  Improve path resolution and make libbpf
building a dependency of source files to force its build.

Samples should now again build with any of:
 cd samples/bpf; make
 make samples/bpf/
 make -C samples/bpf
 cd samples/bpf; make O=builddir
 make samples/bpf/ O=builddir
 make -C samples/bpf O=builddir
 export KBUILD_OUTPUT=builddir
 make samples/bpf/
 make -C samples/bpf

Fixes: 5f9380572b4b ("samples: bpf: compile and link against full libbpf")
Reported-by: Björn Töpel 
Signed-off-by: Jakub Kicinski 
---
 samples/bpf/Makefile | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 9e255ca4059a..0dae77c88d2e 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -1,4 +1,8 @@
 # SPDX-License-Identifier: GPL-2.0
+
+BPF_SAMPLES_PATH ?= $(abspath $(srctree)/$(src))
+TOOLS_PATH := $(BPF_SAMPLES_PATH)/../../tools
+
 # List of programs to build
 hostprogs-y := test_lru_dist
 hostprogs-y += sock_example
@@ -49,7 +53,8 @@ hostprogs-y += xdpsock
 hostprogs-y += xdp_fwd
 
 # Libbpf dependencies
-LIBBPF := ../../tools/lib/bpf/libbpf.a
+LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
+
 CGROUP_HELPERS := ../../tools/testing/selftests/bpf/cgroup_helpers.o
 TRACE_HELPERS := ../../tools/testing/selftests/bpf/trace_helpers.o
 
@@ -233,15 +238,16 @@ CLANG_ARCH_ARGS = -target $(ARCH)
 endif
 
 # Trick to allow make to be run from this directory
-all: $(LIBBPF)
-   $(MAKE) -C ../../ $(CURDIR)/
+all:
+   $(MAKE) -C ../../ $(CURDIR)/ BPF_SAMPLES_PATH=$(CURDIR)
 
 clean:
$(MAKE) -C ../../ M=$(CURDIR) clean
@rm -f *~
 
 $(LIBBPF): FORCE
-   $(MAKE) -C $(dir $@)
+# Fix up variables inherited from Kbuild that tools/ build system won't like
+   $(MAKE) -C $(dir $@) RM='rm -rf' LDFLAGS= 
srctree=$(BPF_SAMPLES_PATH)/../../ O=
 
 $(obj)/syscall_nrs.s:  $(src)/syscall_nrs.c
$(call if_changed_dep,cc_s_c)
@@ -272,7 +278,8 @@ verify_target_bpf: verify_cmds
exit 2; \
else true; fi
 
-$(src)/*.c: verify_target_bpf
+$(BPF_SAMPLES_PATH)/*.c: verify_target_bpf $(LIBBPF)
+$(src)/*.c: verify_target_bpf $(LIBBPF)
 
 $(obj)/tracex5_kern.o: $(obj)/syscall_nrs.h
 
-- 
2.17.0

[PATCH bpf-next v2 1/5] samples: bpf: include bpf/bpf.h instead of local libbpf.h

2018-05-14 Thread Jakub Kicinski

There are two files in the tree called libbpf.h which is becoming
problematic.  Most samples don't actually need the local libbpf.h
they simply include it to get to bpf/bpf.h.  Include bpf/bpf.h
directly instead.

Signed-off-by: Jakub Kicinski 
Acked-by: Jesper Dangaard Brouer 
---
 samples/bpf/bpf_load.c| 2 +-
 samples/bpf/bpf_load.h| 2 +-
 samples/bpf/cpustat_user.c| 2 +-
 samples/bpf/lathist_user.c| 2 +-
 samples/bpf/load_sock_ops.c   | 2 +-
 samples/bpf/lwt_len_hist_user.c   | 2 +-
 samples/bpf/map_perf_test_user.c  | 2 +-
 samples/bpf/sock_example.h| 1 -
 samples/bpf/sockex1_user.c| 2 +-
 samples/bpf/sockex2_user.c| 2 +-
 samples/bpf/sockex3_user.c| 2 +-
 samples/bpf/syscall_tp_user.c | 2 +-
 samples/bpf/tc_l2_redirect_user.c | 2 +-
 samples/bpf/test_cgrp2_array_pin.c| 2 +-
 samples/bpf/test_current_task_under_cgroup_user.c | 2 +-
 samples/bpf/test_lru_dist.c   | 2 +-
 samples/bpf/test_map_in_map_user.c| 2 +-
 samples/bpf/test_overhead_user.c  | 2 +-
 samples/bpf/test_probe_write_user_user.c  | 2 +-
 samples/bpf/trace_output_user.c   | 2 +-
 samples/bpf/tracex1_user.c| 2 +-
 samples/bpf/tracex2_user.c| 2 +-
 samples/bpf/tracex3_user.c| 2 +-
 samples/bpf/tracex4_user.c| 2 +-
 samples/bpf/tracex5_user.c| 2 +-
 samples/bpf/tracex6_user.c| 2 +-
 samples/bpf/tracex7_user.c| 2 +-
 samples/bpf/xdp_fwd_user.c| 2 +-
 samples/bpf/xdp_monitor_user.c| 2 +-
 samples/bpf/xdp_redirect_cpu_user.c   | 2 +-
 samples/bpf/xdp_redirect_map_user.c   | 2 +-
 samples/bpf/xdp_redirect_user.c   | 2 +-
 samples/bpf/xdp_router_ipv4_user.c| 2 +-
 samples/bpf/xdp_tx_iptunnel_user.c| 2 +-
 samples/bpf/xdpsock_user.c| 2 +-
 35 files changed, 34 insertions(+), 35 deletions(-)

diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index a6b290de5632..89161c9ed466 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -24,7 +24,7 @@
 #include 
 #include 
 #include 
-#include "libbpf.h"
+#include 
 #include "bpf_load.h"
 #include "perf-sys.h"
 
diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h
index f9da59bca0cc..814894a12974 100644
--- a/samples/bpf/bpf_load.h
+++ b/samples/bpf/bpf_load.h
@@ -2,7 +2,7 @@
 #ifndef __BPF_LOAD_H
 #define __BPF_LOAD_H
 
-#include "libbpf.h"
+#include 
 
 #define MAX_MAPS 32
 #define MAX_PROGS 32
diff --git a/samples/bpf/cpustat_user.c b/samples/bpf/cpustat_user.c
index 2b4cd1ae57c5..869a99406dbf 100644
--- a/samples/bpf/cpustat_user.c
+++ b/samples/bpf/cpustat_user.c
@@ -17,7 +17,7 @@
 #include 
 #include 
 
-#include "libbpf.h"
+#include 
 #include "bpf_load.h"
 
 #define MAX_CPU8
diff --git a/samples/bpf/lathist_user.c b/samples/bpf/lathist_user.c
index 6477bad5b4e2..c8e88cc84e61 100644
--- a/samples/bpf/lathist_user.c
+++ b/samples/bpf/lathist_user.c
@@ -10,7 +10,7 @@
 #include 
 #include 
 #include 
-#include "libbpf.h"
+#include 
 #include "bpf_load.h"
 
 #define MAX_ENTRIES20
diff --git a/samples/bpf/load_sock_ops.c b/samples/bpf/load_sock_ops.c
index e5da6cf71a3e..8ecb41ea0c03 100644
--- a/samples/bpf/load_sock_ops.c
+++ b/samples/bpf/load_sock_ops.c
@@ -8,7 +8,7 @@
 #include 
 #include 
 #include 
-#include "libbpf.h"
+#include 
 #include "bpf_load.h"
 #include 
 #include 
diff --git a/samples/bpf/lwt_len_hist_user.c b/samples/bpf/lwt_len_hist_user.c
index 7fcb94c09112..587b68b1f8dd 100644
--- a/samples/bpf/lwt_len_hist_user.c
+++ b/samples/bpf/lwt_len_hist_user.c
@@ -9,7 +9,7 @@
 #include 
 #include 
 
-#include "libbpf.h"
+#include 
 #include "bpf_util.h"
 
 #define MAX_INDEX 64
diff --git a/samples/bpf/map_perf_test_user.c b/samples/bpf/map_perf_test_user.c
index 519d9af4b04a..38b7b1a96cc2 100644
--- a/samples/bpf/map_perf_test_user.c
+++ b/samples/bpf/map_perf_test_user.c
@@ -21,7 +21,7 @@
 #include 
 #include 
 
-#include "libbpf.h"
+#include 
 #include "bpf_load.h"
 
 #define TEST_BIT(t) (1U << (t))
diff --git a/samples/bpf/sock_example.h b/samples/bpf/sock_example.h
index 772d5dad8465..a27d7579bc73 100644
--- a/samples/bpf/sock_example.h
+++ b/samples/bpf/sock_example.h
@@ -9,7 +9,6 @@
 #include 
 #include 
 #include 
-#include "libbpf.h"
 
 static inline int open_raw_sock(const char *name)
 {
diff --git a/samples/bpf/sockex1_user.c b/samples/bpf/sockex1_user.c
index 2be935c2627d..93ec01c56104 100644
--- a/samples/bpf/sockex1_user.c
+++ b/samples/bpf/sockex1_

[PATCH bpf-next v2 5/5] samples: bpf: make the build less noisy

2018-05-14 Thread Jakub Kicinski

Building samples with clang ignores the $(Q) setting, always
printing full command to the output.  Make it less verbose.

Signed-off-by: Jakub Kicinski 
---
 samples/bpf/Makefile | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 0036a77c2d97..62d1aa1a4cf3 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -244,7 +244,8 @@ $(obj)/tracex5_kern.o: $(obj)/syscall_nrs.h
 # But, there is no easy way to fix it, so just exclude it since it is
 # useless for BPF samples.
 $(obj)/%.o: $(src)/%.c
-   $(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) -I$(obj) \
+   @echo "  CLANG-bpf " $@
+   $(Q)$(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) -I$(obj) 
\
-I$(srctree)/tools/testing/selftests/bpf/ \
-D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \
-D__TARGET_ARCH_$(ARCH) -Wno-compare-distinct-pointer-types \
-- 
2.17.0

[PATCH bpf-next v2 0/5] samples: bpf: fix build after move to full libbpf

2018-05-14 Thread Jakub Kicinski

Hi!

Following patches address build issues after recent move to libbpf.
For out-of-tree builds we would see the following error:

gcc: error: samples/bpf/../../tools/lib/bpf/libbpf.a: No such file or directory

libbpf build system is now always invoked explicitly rather than
relying on building single objects most of the time.  We need to
resolve the friction between Kbuild and tools/ build system.

Mini-library called libbpf.h in samples is renamed to bpf_insn.h,
using linux/filter.h seems not completely trivial since some samples
get upset when order on include search path in changed.  We do have
to rename libbpf.h, however, because otherwise it's hard to reliably
get to libbpf's header in out-of-tree builds.

v2:
 - fix the build error harder (patch 3);
 - add patch 5 (make clang less noisy).

Jakub Kicinski (5):
  samples: bpf: include bpf/bpf.h instead of local libbpf.h
  samples: bpf: rename libbpf.h to bpf_insn.h
  samples: bpf: fix build after move to compiling full libbpf.a
  samples: bpf: move libbpf from object dependencies to libs
  samples: bpf: make the build less noisy

 samples/bpf/Makefile  | 165 +++---
 samples/bpf/{libbpf.h => bpf_insn.h}  |   8 +-
 samples/bpf/bpf_load.c|   2 +-
 samples/bpf/bpf_load.h|   2 +-
 samples/bpf/cookie_uid_helper_example.c   |   2 +-
 samples/bpf/cpustat_user.c|   2 +-
 samples/bpf/fds_example.c |   4 +-
 samples/bpf/lathist_user.c|   2 +-
 samples/bpf/load_sock_ops.c   |   2 +-
 samples/bpf/lwt_len_hist_user.c   |   2 +-
 samples/bpf/map_perf_test_user.c  |   2 +-
 samples/bpf/sock_example.c|   3 +-
 samples/bpf/sock_example.h|   1 -
 samples/bpf/sockex1_user.c|   2 +-
 samples/bpf/sockex2_user.c|   2 +-
 samples/bpf/sockex3_user.c|   2 +-
 samples/bpf/syscall_tp_user.c |   2 +-
 samples/bpf/tc_l2_redirect_user.c |   2 +-
 samples/bpf/test_cgrp2_array_pin.c|   2 +-
 samples/bpf/test_cgrp2_attach.c   |   3 +-
 samples/bpf/test_cgrp2_attach2.c  |   3 +-
 samples/bpf/test_cgrp2_sock.c |   3 +-
 samples/bpf/test_cgrp2_sock2.c|   3 +-
 .../bpf/test_current_task_under_cgroup_user.c |   2 +-
 samples/bpf/test_lru_dist.c   |   2 +-
 samples/bpf/test_map_in_map_user.c|   2 +-
 samples/bpf/test_overhead_user.c  |   2 +-
 samples/bpf/test_probe_write_user_user.c  |   2 +-
 samples/bpf/trace_output_user.c   |   2 +-
 samples/bpf/tracex1_user.c|   2 +-
 samples/bpf/tracex2_user.c|   2 +-
 samples/bpf/tracex3_user.c|   2 +-
 samples/bpf/tracex4_user.c|   2 +-
 samples/bpf/tracex5_user.c|   2 +-
 samples/bpf/tracex6_user.c|   2 +-
 samples/bpf/tracex7_user.c|   2 +-
 samples/bpf/xdp_fwd_user.c|   2 +-
 samples/bpf/xdp_monitor_user.c|   2 +-
 samples/bpf/xdp_redirect_cpu_user.c   |   2 +-
 samples/bpf/xdp_redirect_map_user.c   |   2 +-
 samples/bpf/xdp_redirect_user.c   |   2 +-
 samples/bpf/xdp_router_ipv4_user.c|   2 +-
 samples/bpf/xdp_tx_iptunnel_user.c|   2 +-
 samples/bpf/xdpsock_user.c|   2 +-
 44 files changed, 116 insertions(+), 147 deletions(-)
 rename samples/bpf/{libbpf.h => bpf_insn.h} (98%)

-- 
2.17.0

Re: [PATCH net-next] erspan: set bso bit based on mirrored packet's len

2018-05-14 Thread Tobin C. Harding

On Mon, May 14, 2018 at 04:54:36PM -0700, William Tu wrote:
> Before the patch, the erspan BSO bit (Bad/Short/Oversized) is not
> handled.  BSO has 4 possible values:
>   00 --> Good frame with no error, or unknown integrity
>   11 --> Payload is a Bad Frame with CRC or Alignment Error
>   01 --> Payload is a Short Frame
>   10 --> Payload is an Oversized Frame
> 
> Based the short/oversized definitions in RFC1757, the patch sets
> the bso bit based on the mirrored packet's size.
> 
> Reported-by: Xiaoyan Jin 
> Signed-off-by: William Tu 
> ---
>  include/net/erspan.h | 25 +
>  1 file changed, 25 insertions(+)
> 
> diff --git a/include/net/erspan.h b/include/net/erspan.h
> index d044aa60cc76..5eb95f78ad45 100644
> --- a/include/net/erspan.h
> +++ b/include/net/erspan.h
> @@ -219,6 +219,30 @@ static inline __be32 erspan_get_timestamp(void)
>   return htonl((u32)h_usecs);
>  }
>  
> +/* ERSPAN BSO (Bad/Short/Oversized)
> + *   00b --> Good frame with no error, or unknown integrity
> + *   01b --> Payload is a Short Frame
> + *   10b --> Payload is an Oversized Frame
> + *   11b --> Payload is a Bad Frame with CRC or Alignment Error
> + */
> +enum erspan_bso {
> + BSO_NOERROR,
> + BSO_SHORT,
> + BSO_OVERSIZED,
> + BSO_BAD,
> +};

If we are relying on the values perhaps this would be clearer

BSO_NOERROR = 0x00,
BSO_SHORT   = 0x01,
BSO_OVERSIZED   = 0x02,
BSO_BAD = 0x03,

> +
> +static inline u8 erspan_detect_bso(struct sk_buff *skb)
> +{
> + if (skb->len < ETH_ZLEN)
> + return BSO_SHORT;
> +
> + if (skb->len > ETH_FRAME_LEN)
> + return BSO_OVERSIZED;
> +
> + return BSO_NOERROR;
> +}

Without having much contextual knowledge around this patch; should we be
doing some check on CRC or alignment (at some stage)?  Having BSO_BAD
seems to imply so? 


Hope this helps,
Tobin.

Re: [PATCH net-next] sched: cls: enable verbose logging

2018-05-14 Thread Cong Wang

On Mon, May 14, 2018 at 1:47 PM, Marcelo Ricardo Leitner
 wrote:
> On Mon, May 14, 2018 at 01:30:53PM -0700, Cong Wang wrote:
>> On Sun, May 13, 2018 at 1:44 PM, Marcelo Ricardo Leitner
>>  wrote:
>> > Currently, when the rule is not to be exclusively executed by the
>> > hardware, extack is not passed along and offloading failures don't
>> > get logged. The idea was that hardware failures are okay because the
>> > rule will get executed in software then and this way it doesn't confuse
>> > unware users.
>> >
>> > But this is not helpful in case one needs to understand why a certain
>> > rule failed to get offloaded. Considering it may have been a temporary
>> > failure, like resources exceeded or so, reproducing it later and knowing
>> > that it is triggering the same reason may be challenging.
>>
>> I fail to understand why you need a flag here, IOW, why not just pass
>> extack unconditionally?
>
> Because (as discussed in the RFC[1], should have linked it here) it
> could confuse users that are not aware of offloading and, in other
> cases, it can be just noise (like it would be right now for ebpf,
> which is mostly used in sw-path).
>
> 1.https://www.mail-archive.com/netdev@vger.kernel.org/msg223016.html

My point is that a TC filter flag should be used for a filter attribute,
logging is apparently not a part of filter. At least, put it into HW offloading,
not in TC filter.

I know DaveM hates module parameters, but a module parameter here
is more suitable than a TC filter flag.

Re: [PATCH v1 3/4] media: rc bpf: move ir_raw_event to uapi

2018-05-14 Thread Y Song

On Mon, May 14, 2018 at 2:11 PM, Sean Young  wrote:
> The context provided to a BPF_PROG_RAWIR_DECODER is a struct ir_raw_event;
> ensure user space has a a definition.
>
> Signed-off-by: Sean Young 
> ---
>  include/media/rc-core.h| 19 +--
>  include/uapi/linux/bpf_rcdev.h | 24 

Patch #2 already referenced this file. So if Patches #1 and #2
applied, there will be
a compilation error. Could you re-arrange your patchset so that after
sequentially
applying each patch, there is no compilation error?

>  2 files changed, 25 insertions(+), 18 deletions(-)
>  create mode 100644 include/uapi/linux/bpf_rcdev.h
>
> diff --git a/include/media/rc-core.h b/include/media/rc-core.h
> index 6742fd86ff65..5d31e31d8ade 100644
> --- a/include/media/rc-core.h
> +++ b/include/media/rc-core.h
> @@ -21,6 +21,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>
>  /**
> @@ -299,24 +300,6 @@ void rc_keydown_notimeout(struct rc_dev *dev, enum 
> rc_proto protocol,
>  void rc_keyup(struct rc_dev *dev);
>  u32 rc_g_keycode_from_table(struct rc_dev *dev, u32 scancode);
>
> -/*
> - * From rc-raw.c
> - * The Raw interface is specific to InfraRed. It may be a good idea to
> - * split it later into a separate header.
> - */
> -struct ir_raw_event {
> -   union {
> -   u32 duration;
> -   u32 carrier;
> -   };
> -   u8  duty_cycle;
> -
> -   unsignedpulse:1;
> -   unsignedreset:1;
> -   unsignedtimeout:1;
> -   unsignedcarrier_report:1;
> -};
> -
>  #define DEFINE_IR_RAW_EVENT(event) struct ir_raw_event event = {}
>
>  static inline void init_ir_raw_event(struct ir_raw_event *ev)
> diff --git a/include/uapi/linux/bpf_rcdev.h b/include/uapi/linux/bpf_rcdev.h
> new file mode 100644
> index ..d8629ff2b960
> --- /dev/null
> +++ b/include/uapi/linux/bpf_rcdev.h
> @@ -0,0 +1,24 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/* Copyright (c) 2018 Sean Young 
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + */
> +#ifndef _UAPI__LINUX_BPF_RCDEV_H__
> +#define _UAPI__LINUX_BPF_RCDEV_H__
> +
> +struct ir_raw_event {
> +   union {
> +   __u32   duration;
> +   __u32   carrier;
> +   };
> +   __u8duty_cycle;
> +
> +   unsignedpulse:1;
> +   unsignedreset:1;
> +   unsignedtimeout:1;
> +   unsignedcarrier_report:1;
> +};
> +
> +#endif /* _UAPI__LINUX_BPF_RCDEV_H__ */
> --
> 2.17.0
>

Re: [PATCH v1 2/4] media: bpf: allow raw IR decoder bpf programs to be used

2018-05-14 Thread Y Song

On Mon, May 14, 2018 at 2:10 PM, Sean Young  wrote:
> This implements attaching, detaching, querying and execution. The target
> fd has to be the /dev/lircN device.
>
> Signed-off-by: Sean Young 
> ---
>  drivers/media/rc/ir-bpf-decoder.c | 191 ++
>  drivers/media/rc/lirc_dev.c   |  30 +
>  drivers/media/rc/rc-core-priv.h   |  15 +++
>  drivers/media/rc/rc-ir-raw.c  |   5 +
>  include/uapi/linux/bpf.h  |   1 +
>  kernel/bpf/syscall.c  |   7 ++
>  6 files changed, 249 insertions(+)
>
> diff --git a/drivers/media/rc/ir-bpf-decoder.c 
> b/drivers/media/rc/ir-bpf-decoder.c
> index aaa5e208b1a5..651590a14772 100644
> --- a/drivers/media/rc/ir-bpf-decoder.c
> +++ b/drivers/media/rc/ir-bpf-decoder.c
> @@ -91,3 +91,194 @@ const struct bpf_verifier_ops ir_decoder_verifier_ops = {
> .get_func_proto  = ir_decoder_func_proto,
> .is_valid_access = ir_decoder_is_valid_access
>  };
> +
> +#define BPF_MAX_PROGS 64
> +
> +int rc_dev_bpf_attach(struct rc_dev *rcdev, struct bpf_prog *prog, u32 flags)

flags is not used in this function.

> +{
> +   struct ir_raw_event_ctrl *raw;
> +   struct bpf_prog_array __rcu *old_array;
> +   struct bpf_prog_array *new_array;
> +   int ret;
> +
> +   if (rcdev->driver_type != RC_DRIVER_IR_RAW)
> +   return -EINVAL;
> +
> +   ret = mutex_lock_interruptible(&rcdev->lock);
> +   if (ret)
> +   return ret;
> +
> +   raw = rcdev->raw;
> +
> +   if (raw->progs && bpf_prog_array_length(raw->progs) >= BPF_MAX_PROGS) 
> {
> +   ret = -E2BIG;
> +   goto out;
> +   }
> +
> +   old_array = raw->progs;
> +   ret = bpf_prog_array_copy(old_array, NULL, prog, &new_array);
> +   if (ret < 0)
> +   goto out;
> +
> +   rcu_assign_pointer(raw->progs, new_array);
> +   bpf_prog_array_free(old_array);
> +out:
> +   mutex_unlock(&rcdev->lock);
> +   return ret;
> +}
> +
> +int rc_dev_bpf_detach(struct rc_dev *rcdev, struct bpf_prog *prog, u32 flags)

flags is not used in this function.

> +{
> +   struct ir_raw_event_ctrl *raw;
> +   struct bpf_prog_array __rcu *old_array;
> +   struct bpf_prog_array *new_array;
> +   int ret;
> +
> +   if (rcdev->driver_type != RC_DRIVER_IR_RAW)
> +   return -EINVAL;
> +
> +   ret = mutex_lock_interruptible(&rcdev->lock);
> +   if (ret)
> +   return ret;
> +
> +   raw = rcdev->raw;
> +
> +   old_array = raw->progs;
> +   ret = bpf_prog_array_copy(old_array, prog, NULL, &new_array);
> +   if (ret < 0) {
> +   bpf_prog_array_delete_safe(old_array, prog);
> +   } else {
> +   rcu_assign_pointer(raw->progs, new_array);
> +   bpf_prog_array_free(old_array);
> +   }
> +
> +   bpf_prog_put(prog);
> +   mutex_unlock(&rcdev->lock);
> +   return 0;
> +}
> +
> +void rc_dev_bpf_run(struct rc_dev *rcdev)
> +{
> +   struct ir_raw_event_ctrl *raw = rcdev->raw;
> +
> +   if (raw->progs)
> +   BPF_PROG_RUN_ARRAY(raw->progs, &raw->prev_ev, BPF_PROG_RUN);
> +}
> +
> +void rc_dev_bpf_put(struct rc_dev *rcdev)
> +{
> +   struct bpf_prog_array *progs = rcdev->raw->progs;
> +   int i, size;
> +
> +   if (!progs)
> +   return;
> +
> +   size = bpf_prog_array_length(progs);
> +   for (i = 0; i < size; i++)
> +   bpf_prog_put(progs->progs[i]);
> +
> +   bpf_prog_array_free(rcdev->raw->progs);
> +}
> +
> +int rc_dev_prog_attach(const union bpf_attr *attr)
> +{
> +   struct bpf_prog *prog;
> +   struct rc_dev *rcdev;
> +   int ret;
> +
> +   if (attr->attach_flags & BPF_F_ALLOW_OVERRIDE)
> +   return -EINVAL;

Looks like you really did not use flags except here.
BPF_F_ALLOW_OVERRIDE is originally used for
cgroup type of attachment and the comment explicits
saying so.

In the query below, the flags value "0" is copied to userspace.

In your case, I think you can just disallow any value, i.g.,
attr->attach_flags must be 0, and then you further down
check that if the prog is already in the array, you just return an error.

> +
> +   prog = bpf_prog_get_type(attr->attach_bpf_fd,
> +BPF_PROG_TYPE_RAWIR_DECODER);
> +   if (IS_ERR(prog))
> +   return PTR_ERR(prog);
> +
> +   rcdev = rc_dev_get_from_fd(attr->target_fd);
> +   if (IS_ERR(rcdev)) {
> +   bpf_prog_put(prog);
> +   return PTR_ERR(rcdev);
> +   }
> +
> +   ret = rc_dev_bpf_attach(rcdev, prog, attr->attach_flags);
> +   if (ret)
> +   bpf_prog_put(prog);
> +
> +   put_device(&rcdev->dev);
> +
> +   return ret;
> +}
> +
> +int rc_dev_prog_detach(const union bpf_attr *attr)
> +{
> +   struct bpf_prog *prog;
> +   struct rc_dev *rcdev;
> +   int ret;
> +
> +   if (attr->attach_flags & BPF_F_ALLOW_OVERRIDE)
>

Re: [PATCH RFC net-next 0/7] net/ipv6: Fix route append and replace use cases

2018-05-14 Thread Tobin C. Harding

Hi David,

On Mon, May 14, 2018 at 07:51:05PM -0700, David Ahern wrote:
> This patch set fixes a few append and replace uses cases for IPv6 and
> adds test cases that codifies the expectations of how append and replace
> are expected to work.

Nood question: what commit does this apply on top of please.  I
attempted to apply the set on top of net-next

commit (961423f9fcbc Merge branch 'sctp-Introduce-sctp_flush_ctx')

patch 4 and 5 have merge conflicts (I stopped at 5).

thanks,
Tobin.

Re: possible deadlock in sk_diag_fill

2018-05-14 Thread Dmitry Vyukov

On Mon, May 14, 2018 at 8:00 PM, Andrei Vagin  wrote:
>> >> Hello,
>> >>
>> >> syzbot found the following crash on:
>> >>
>> >> HEAD commit:c1c07416cdd4 Merge tag 'kbuild-fixes-v4.17' of 
>> >> git://git.k..
>> >> git tree:   upstream
>> >> console output: https://syzkaller.appspot.com/x/log.txt?x=12164c9780
>> >> kernel config:  https://syzkaller.appspot.com/x/.config?x=5a1dc06635c10d27
>> >> dashboard link: 
>> >> https://syzkaller.appspot.com/bug?extid=c1872be62e587eae9669
>> >> compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
>> >> userspace arch: i386
>> >>
>> >> Unfortunately, I don't have any reproducer for this crash yet.
>> >>
>> >> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> >> Reported-by: syzbot+c1872be62e587eae9...@syzkaller.appspotmail.com
>> >>
>> >>
>> >> ==
>> >> WARNING: possible circular locking dependency detected
>> >> 4.17.0-rc3+ #59 Not tainted
>> >> --
>> >> syz-executor1/25282 is trying to acquire lock:
>> >> 4fddf743 (&(&u->lock)->rlock/1){+.+.}, at: sk_diag_dump_icons
>> >> net/unix/diag.c:82 [inline]
>> >> 4fddf743 (&(&u->lock)->rlock/1){+.+.}, at:
>> >> sk_diag_fill.isra.5+0xa43/0x10d0 net/unix/diag.c:144
>> >>
>> >> but task is already holding lock:
>> >> b6895645 (rlock-AF_UNIX){+.+.}, at: spin_lock
>> >> include/linux/spinlock.h:310 [inline]
>> >> b6895645 (rlock-AF_UNIX){+.+.}, at: sk_diag_dump_icons
>> >> net/unix/diag.c:64 [inline]
>> >> b6895645 (rlock-AF_UNIX){+.+.}, at: 
>> >> sk_diag_fill.isra.5+0x94e/0x10d0
>> >> net/unix/diag.c:144
>> >>
>> >> which lock already depends on the new lock.
>> >
>> > In the code, we have a comment which explains why it is safe to take this 
>> > lock
>> >
>> > /*
>> >  * The state lock is outer for the same sk's
>> >  * queue lock. With the other's queue locked it's
>> >  * OK to lock the state.
>> >  */
>> > unix_state_lock_nested(req);
>> >
>> > It is a question how to explain this to lockdep.
>>
>> Do I understand it correctly that (&u->lock)->rlock associated with
>> AF_UNIX is locked under rlock-AF_UNIX, and then rlock-AF_UNIX is
>> locked under (&u->lock)->rlock associated with AF_NETLINK? If so, I
>> think we need to split (&u->lock)->rlock by family too, so that we
>> have u->lock-AF_UNIX and u->lock-AF_NETLINK.
>
> I think here is another problem. lockdep woried about
> sk->sk_receive_queue vs unix_sk(s)->lock.
>
> sk_diag_dump_icons() takes sk->sk_receive_queue and then
> unix_sk(s)->lock.
>
> unix_dgram_sendmsg takes unix_sk(sk)->lock and then sk->sk_receive_queue.
>
> sk_diag_dump_icons() takes locks for two different sockets, but
> unix_dgram_sendmsg() takes locks for one socket.
>
> sk_diag_dump_icons
> if (sk->sk_state == TCP_LISTEN) {
> spin_lock(&sk->sk_receive_queue.lock);
> skb_queue_walk(&sk->sk_receive_queue, skb) {
> unix_state_lock_nested(req);
> spin_lock_nested(&unix_sk(s)->lock,
>
>
> unix_dgram_sendmsg
> unix_state_lock(other)
> spin_lock(&unix_sk(s)->lock)
> skb_queue_tail(&other->sk_receive_queue, skb);
> spin_lock_irqsave(&list->lock, flags);


Do you mean the following?
There is socket 1 with state lock (S1) and queue lock (Q2), and socket
2 with state lock (S2) and queue lock (Q2). unix_dgram_sendmsg lock
S1->Q1. And sk_diag_dump_icons locks Q1->S2.
If yes, then this looks pretty much as deadlock. Consider that 2
unix_dgram_sendmsg in 2 different threads lock S1 and S2 respectively.
Now 2  sk_diag_dump_icons in 2 different threads lock Q1 and Q2
respectively. Now sk_diag_dump_icons want to lock S's, and
unix_dgram_sendmsg want to lock Q's. Nobody can proceed.

Re: [PATCH v1 1/4] media: rc: introduce BPF_PROG_IR_DECODER

2018-05-14 Thread Y Song

On Mon, May 14, 2018 at 2:10 PM, Sean Young  wrote:
> Add support for BPF_PROG_IR_DECODER. This type of BPF program can call
> rc_keydown() to reported decoded IR scancodes, or rc_repeat() to report
> that the last key should be repeated.
>
> Signed-off-by: Sean Young 
> ---
>  drivers/media/rc/Kconfig  |  8 +++
>  drivers/media/rc/Makefile |  1 +
>  drivers/media/rc/ir-bpf-decoder.c | 93 +++
>  include/linux/bpf_types.h |  3 +
>  include/uapi/linux/bpf.h  | 16 +-
>  5 files changed, 120 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/media/rc/ir-bpf-decoder.c
>
> diff --git a/drivers/media/rc/Kconfig b/drivers/media/rc/Kconfig
> index eb2c3b6eca7f..10ad6167d87c 100644
> --- a/drivers/media/rc/Kconfig
> +++ b/drivers/media/rc/Kconfig
> @@ -120,6 +120,14 @@ config IR_IMON_DECODER
>remote control and you would like to use it with a raw IR
>receiver, or if you wish to use an encoder to transmit this IR.
>
> +config IR_BPF_DECODER
> +   bool "Enable IR raw decoder using BPF"
> +   depends on BPF_SYSCALL
> +   depends on RC_CORE=y
> +   help
> +  Enable this option to make it possible to load custom IR
> +  decoders written in BPF.
> +
>  endif #RC_DECODERS
>
>  menuconfig RC_DEVICES
> diff --git a/drivers/media/rc/Makefile b/drivers/media/rc/Makefile
> index 2e1c87066f6c..12e1118430d0 100644
> --- a/drivers/media/rc/Makefile
> +++ b/drivers/media/rc/Makefile
> @@ -5,6 +5,7 @@ obj-y += keymaps/
>  obj-$(CONFIG_RC_CORE) += rc-core.o
>  rc-core-y := rc-main.o rc-ir-raw.o
>  rc-core-$(CONFIG_LIRC) += lirc_dev.o
> +rc-core-$(CONFIG_IR_BPF_DECODER) += ir-bpf-decoder.o
>  obj-$(CONFIG_IR_NEC_DECODER) += ir-nec-decoder.o
>  obj-$(CONFIG_IR_RC5_DECODER) += ir-rc5-decoder.o
>  obj-$(CONFIG_IR_RC6_DECODER) += ir-rc6-decoder.o
> diff --git a/drivers/media/rc/ir-bpf-decoder.c 
> b/drivers/media/rc/ir-bpf-decoder.c
> new file mode 100644
> index ..aaa5e208b1a5
> --- /dev/null
> +++ b/drivers/media/rc/ir-bpf-decoder.c
> @@ -0,0 +1,93 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// ir-bpf-decoder.c - handles bpf decoders
> +//
> +// Copyright (C) 2018 Sean Young 
> +
> +#include 
> +#include 
> +#include "rc-core-priv.h"
> +
> +/*
> + * BPF interface for raw IR decoder
> + */
> +const struct bpf_prog_ops ir_decoder_prog_ops = {
> +};
> +
> +BPF_CALL_1(bpf_rc_repeat, struct ir_raw_event*, event)
> +{
> +   struct ir_raw_event_ctrl *ctrl;
> +
> +   ctrl = container_of(event, struct ir_raw_event_ctrl, prev_ev);
> +
> +   rc_repeat(ctrl->dev);
> +   return 0;
> +}
> +
> +static const struct bpf_func_proto rc_repeat_proto = {
> +   .func  = bpf_rc_repeat,
> +   .gpl_only  = true, // rc_repeat is EXPORT_SYMBOL_GPL
> +   .ret_type  = RET_VOID,

I suggest using RET_INTEGER here since we do return an integer 0.
RET_INTEGER will also make it easy to extend if you want to return
a non-zero value for error code or other reasons.

> +   .arg1_type = ARG_PTR_TO_CTX,
> +};
> +
> +BPF_CALL_4(bpf_rc_keydown, struct ir_raw_event*, event, u32, protocol,
> +  u32, scancode, u32, toggle)
> +{
> +   struct ir_raw_event_ctrl *ctrl;
> +
> +   ctrl = container_of(event, struct ir_raw_event_ctrl, prev_ev);
> +   rc_keydown(ctrl->dev, protocol, scancode, toggle != 0);
> +   return 0;
> +}
> +
> +static const struct bpf_func_proto rc_keydown_proto = {
> +   .func  = bpf_rc_keydown,
> +   .gpl_only  = true, // rc_keydown is EXPORT_SYMBOL_GPL
> +   .ret_type  = RET_VOID,

ditto. RET_INTEGER is preferable.

> +   .arg1_type = ARG_PTR_TO_CTX,
> +   .arg2_type = ARG_ANYTHING,
> +   .arg3_type = ARG_ANYTHING,
> +   .arg4_type = ARG_ANYTHING,
> +};
> +
> +static const struct bpf_func_proto *ir_decoder_func_proto(enum bpf_func_id 
> func_id, const struct bpf_prog *prog)
> +{
> +   switch (func_id) {
> +   case BPF_FUNC_rc_repeat:
> +   return &rc_repeat_proto;
> +   case BPF_FUNC_rc_keydown:
> +   return &rc_keydown_proto;
> +   case BPF_FUNC_map_lookup_elem:
> +   return &bpf_map_lookup_elem_proto;
> +   case BPF_FUNC_map_update_elem:
> +   return &bpf_map_update_elem_proto;
> +   case BPF_FUNC_map_delete_elem:
> +   return &bpf_map_delete_elem_proto;
> +   case BPF_FUNC_ktime_get_ns:
> +   return &bpf_ktime_get_ns_proto;
> +   case BPF_FUNC_tail_call:
> +   return &bpf_tail_call_proto;
> +   case BPF_FUNC_get_prandom_u32:
> +   return &bpf_get_prandom_u32_proto;
> +   default:
> +   return NULL;
> +   }
> +}
> +
> +static bool ir_decoder_is_valid_access(int off, int size,
> +  enum bpf_access_type type,
> +  const struct bpf_prog *prog,
> +  struct bpf_insn_access_aux *info

Re: [PATCH net-next 2/3] net: qualcomm: rmnet: Add support for ethtool private stats

2018-05-14 Thread kbuild test robot

Hi Subash,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Subash-Abhinov-Kasiviswanathan/net-qualcomm-rmnet-Updates-2018-05-14/20180515-115355
config: i386-allmodconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All warnings (new ones prefixed by >>):

   drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c: In function 
'rmnet_map_checksum_downlink_packet':
>> drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c:382:2: warning: this 
>> 'else' clause does not guard... [-Wmisleading-indentation]
 else
 ^~~~
   drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c:384:3: note: ...this 
statement, but the latter is misleadingly indented as if it were guarded by the 
'else'
  return -EPROTONOSUPPORT;
  ^~

vim +/else +382 drivers/net/ethernet/qualcomm/rmnet/rmnet_map_data.c

   349  
   350  /* Validates packet checksums. Function takes a pointer to
   351   * the beginning of a buffer which contains the IP payload +
   352   * padding + checksum trailer.
   353   * Only IPv4 and IPv6 are supported along with TCP & UDP.
   354   * Fragmented or tunneled packets are not supported.
   355   */
   356  int rmnet_map_checksum_downlink_packet(struct sk_buff *skb, u16 len)
   357  {
   358  struct rmnet_priv *priv = netdev_priv(skb->dev);
   359  struct rmnet_map_dl_csum_trailer *csum_trailer;
   360  
   361  if (unlikely(!(skb->dev->features & NETIF_F_RXCSUM))) {
   362  priv->stats.csum_sw++;
   363  return -EOPNOTSUPP;
   364  }
   365  
   366  csum_trailer = (struct rmnet_map_dl_csum_trailer *)(skb->data + 
len);
   367  
   368  if (!csum_trailer->valid) {
   369  priv->stats.csum_valid_unset++;
   370  return -EINVAL;
   371  }
   372  
   373  if (skb->protocol == htons(ETH_P_IP))
   374  return rmnet_map_ipv4_dl_csum_trailer(skb, 
csum_trailer, priv);
   375  else if (skb->protocol == htons(ETH_P_IPV6))
   376  #if IS_ENABLED(CONFIG_IPV6)
   377  return rmnet_map_ipv6_dl_csum_trailer(skb, 
csum_trailer, priv);
   378  #else
   379  priv->stats.csum_err_invalid_ip_version++;
   380  return -EPROTONOSUPPORT;
   381  #endif
 > 382  else
   383  priv->stats.csum_err_invalid_ip_version++;
   384  return -EPROTONOSUPPORT;
   385  
   386  return 0;
   387  }
   388  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

[PATCH net] tcp: purge write queue in tcp_connect_init()

2018-05-14 Thread Eric Dumazet

syzkaller found a reliable way to crash the host, hitting a BUG()
in __tcp_retransmit_skb()

Malicous MSG_FASTOPEN is the root cause. We need to purge write queue
in tcp_connect_init() at the point we init snd_una/write_seq.

This patch also replaces the BUG() by a less intrusive WARN_ON_ONCE()

kernel BUG at net/ipv4/tcp_output.c:2837!
invalid opcode:  [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 5276 Comm: syz-executor0 Not tainted 4.17.0-rc3+ #51
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
RIP: 0010:__tcp_retransmit_skb+0x2992/0x2eb0 net/ipv4/tcp_output.c:2837
RSP: :8801dae06ff8 EFLAGS: 00010206
RAX: 8801b9fe61c0 RBX: ffc18a16 RCX: 864e1a49
RDX: 0100 RSI: 864e2e12 RDI: 0005
RBP: 8801dae073a0 R08: 8801b9fe61c0 R09: ed0039c40dd2
R10: ed0039c40dd2 R11: 8801ce206e93 R12: 421eeaad
R13: 8801ce206d4e R14: 8801ce206cc0 R15: 8801cd4f4a80
FS:  () GS:8801dae0(0063) knlGS:096bc900
CS:  0010 DS: 002b ES: 002b CR0: 80050033
CR2: 2000 CR3: 0001c47b6000 CR4: 001406f0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 
 tcp_retransmit_skb+0x2e/0x250 net/ipv4/tcp_output.c:2923
 tcp_retransmit_timer+0xc50/0x3060 net/ipv4/tcp_timer.c:488
 tcp_write_timer_handler+0x339/0x960 net/ipv4/tcp_timer.c:573
 tcp_write_timer+0x111/0x1d0 net/ipv4/tcp_timer.c:593
 call_timer_fn+0x230/0x940 kernel/time/timer.c:1326
 expire_timers kernel/time/timer.c:1363 [inline]
 __run_timers+0x79e/0xc50 kernel/time/timer.c:1666
 run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692
 __do_softirq+0x2e0/0xaf5 kernel/softirq.c:285
 invoke_softirq kernel/softirq.c:365 [inline]
 irq_exit+0x1d1/0x200 kernel/softirq.c:405
 exiting_irq arch/x86/include/asm/apic.h:525 [inline]
 smp_apic_timer_interrupt+0x17e/0x710 arch/x86/kernel/apic/apic.c:1052
 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:863

Fixes: cf60af03ca4e ("net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN)")
Signed-off-by: Eric Dumazet 
Cc: Yuchung Cheng 
Cc: Neal Cardwell 
Reported-by: syzbot 
---
 net/ipv4/tcp_output.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 
383cac0ff0ec059ca7dbc1a6304cc7f8183e008d..d07e34f8e3091144976358674b92458076f92bfb
 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2833,8 +2833,10 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff 
*skb, int segs)
return -EBUSY;
 
if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) {
-   if (before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))
-   BUG();
+   if (unlikely(before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))) {
+   WARN_ON_ONCE(1);
+   return -EINVAL;
+   }
if (tcp_trim_head(sk, skb, tp->snd_una - TCP_SKB_CB(skb)->seq))
return -ENOMEM;
}
@@ -3342,6 +3344,7 @@ static void tcp_connect_init(struct sock *sk)
sock_reset_flag(sk, SOCK_DONE);
tp->snd_wnd = 0;
tcp_init_wl(tp, 0);
+   tcp_write_queue_purge(sk);
tp->snd_una = tp->write_seq;
tp->snd_sml = tp->write_seq;
tp->snd_up = tp->write_seq;
-- 
2.17.0.441.gb46fe60e1d-goog

Re: [PATCH iproute2] ip: do not drop capabilities if net_admin=i is set

2018-05-14 Thread Stephen Hemminger

On Fri, 11 May 2018 13:39:56 +0100
Luca Boccassi  wrote:

> Users have reported a regression due to ip now dropping capabilities
> unconditionally.
> zerotier-one VPN and VirtualBox use ambient capabilities in their
> binary and then fork out to ip to set routes and links, and this
> does not work anymore.
> 
> As a workaround, do not drop caps if CAP_NET_ADMIN (the most common
> capability used by ip) is set with the INHERITABLE flag.
> Users that want ip vrf exec to work do not need to set INHERITABLE,
> which will then only set when the calling program had privileges to
> give itself the ambient capability.
> 
> Fixes: ba2fc55b99f8 ("Drop capabilities if not running ip exec vrf with 
> libcap")
> 
> Signed-off-by: Luca Boccassi 

Applied

mounting NFS on the same host leads to D state

2018-05-14 Thread maowenan

Hi,

Recently I have tested NFS and exportfs scenario,
that NFS server and client are in the same host.
And I found mounting NFS filesystm onto the same host
can lead to rpc.mountd and related task become D state.
My kernel version is based on 3.10, and I find 4.15 has the same
appearance.

My test step as below:
1)create dir.
mkdir -p /home/test1 /home/test2
2)share dir /home/test1
echo '/home/test1 localhost(rw,all_squash,anonuid=0,anongid=0)' > /etc/exports
3)exportfs
exportfs -vr || echo "Failed to export /home/test1"
4)mount NFS.
mount localhost:/home/test1 /home/test2 -o vers=3,soft
5)share dir /home/test2
echo '/home/test2 *(rw,all_squash,anonuid=0,anongid=0)' >> /etc/exports
6)exportfs
exportfs -vr
7) list /home/test2
ls /home/test2
then we found ls command is hung, ls and rpc.mountd became "D" state, and after
180 second ls command return.

Another scenario as below:
1)create dir.
mkdir -p /home/test3 /home/test4
2)share dir /home/test3
echo '/home/test3 
localhost(rw,sync,no_wdelay,anonuid=0,anongid=0,no_subtree_check)' > 
/etc/exports
3)exportfs
exportfs -r
4)to see NFS status
showmount -e localhost
5)mount NFS
mount -t nfs4 -o proto=tcp,nolock,soft,timeo=50 localhost:/home/test3 
/home/test4
6) stop nfs service,and  and check ls task state is D.
service nfs stop
ls /home/test4
ls command is hung and became D state.

I wonder to know is it reasonable about these test scenario because NFS server 
and
client are in the same host? Since some task went into D state, is there any 
reason about this?
and is there any patch to fix this issue?
Here is a link to talk about NFS mounting on the same host,  
https://lwn.net/Articles/595652/

Re: [RFC v2 bpf-next 8/9] bpf: Provide helper to do lookups in kernel FIB table

2018-05-14 Thread David Ahern

On 4/29/18 7:13 PM, David Ahern wrote:
> 
> The idea here is to fast pass packets that fit a supported profile and
> are to be forwarded. Everything else should continue up the stack as it
> has wider capabilities. The helper and XDP programs should make no
> assumptions on what the broader kernel and userspace might be monitoring
> or want to do with packets that can not be forwarded in the fast path.
> This is very similar to hardware forwarding when it punts packets to the
> CPU for control plane assistance.
> 

Thinking about this some more and how to return more information to the
bpf program about the FIB lookup.

bpf_fib_lookup struct is 64-bytes. It can not be expanded without
hurting performance. I could do another union on an input parameter and
return flags indicating why the returned index is 0. Something like this:

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 360a1168c353..75591522444c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2314,6 +2314,12 @@ struct bpf_raw_tracepoint_args {
 #define BPF_FIB_LOOKUP_DIRECT  BIT(0)
 #define BPF_FIB_LOOKUP_OUTPUT  BIT(1)

+#define BPF_FIB_LKUP_RET_NO_FWD  BIT(0)  /* pkt is not fwded */
+#define BPF_FIB_LKUP_RET_UNSUPP_LWT  BIT(1)  /* fwd requires unsupp
encap */
+#define BPF_FIB_LKUP_RET_NO_NHDEVBIT(2)  /* nh device does not exist */
+#define BPF_FIB_LKUP_RET_NO_NEIGHBIT(3)  /* no neigh entry for nh */
+#define BPF_FIB_LKUP_RET_FRAG_NEEDED BIT(4)  /* pkt too big to fwd */
+
 struct bpf_fib_lookup {
/* input */
__u8family;   /* network family, AF_INET, AF_INET6, AF_MPLS */
@@ -2325,7 +2331,11 @@ struct bpf_fib_lookup {

/* total length of packet from network header - used for MTU
check */
__u16   tot_len;
-   __u32   ifindex;  /* L3 device index for lookup */
+
+   union {
+   __u32   ifindex;   /* in: L3 device index for lookup */
+   __u32   ret_flags; /* out: BPF_FIB_LOOKUP_RET flags */
+   }

union {
/* inputs to lookup */


Similarly for the fib result, it could be returned with a union on say
family:
union {
__u8 family;   /* in: network family, AF_INET, AF_INET6, AF_MPLS */
__u8 rt_type;  /* out: FIB lookup route type */
};

Then if the fib result is -EINVAL/-EHOSTUNREACH/-EACCES, rt_type is set
to RTN_BLACKHOLE/RTN_UNREACHABLE/RTN_PROHIBIT allowing the XDP program
to make an informed decision on dropping the packet.

To avoid performance hits on the forwarding path, these return values
would *only* set if the ifindex returned is 0.

Re: [PATCH] lib/rhashtable: reorder some inititalization sequences

2018-05-14 Thread Herbert Xu

On Mon, May 14, 2018 at 10:52:13PM -0400, David Miller wrote:
> From: Davidlohr Bueso 
> Date: Mon, 14 May 2018 08:13:32 -0700
> 
> > rhashtable_init() allocates memory at the very end of the
> > call, once everything is setup; with the exception of the
> > nelems parameter. However, unless the user is doing something
> > bogus with params for which -EINVAL is returned, memory
> > allocation is the only operation that can trigger the call
> > to fail.
> > 
> > Thus move bucket_table_alloc() up such that we fail back to
> > the caller asap, instead of doing useless checks. This is
> > safe as the the table allocation isn't using the halfly
> > setup 'ht' structure and bucket_table_alloc() call chain only
> > ends up using the ht->nulls_base member in INIT_RHT_NULLS_HEAD.
> > 
> > Also move the locking initialization down to the end.
> > 
> > Signed-off-by: Davidlohr Bueso 
> 
> The user potentially "doing something bogus" is why the most
> expensive part of the initialization (the memory allocation)
> is done after everything else is validated.
> 
> I think it's best to keep things as-is.

I agree.

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH ghak81 RFC V2 3/5] audit: use inline function to get audit context

2018-05-14 Thread Richard Guy Briggs

On 2018-05-14 23:05, Richard Guy Briggs wrote:
> On 2018-05-14 17:44, Paul Moore wrote:
> > On Sat, May 12, 2018 at 9:58 PM, Richard Guy Briggs  wrote:
> > > Recognizing that the audit context is an internal audit value, use an
> > > access function to retrieve the audit context pointer for the task
> > > rather than reaching directly into the task struct to get it.
> > >
> > > Signed-off-by: Richard Guy Briggs 
> > > ---
> > >  include/linux/audit.h| 14 ++--
> > >  include/net/xfrm.h   |  2 +-
> > >  kernel/audit.c   |  6 ++--
> > >  kernel/audit_watch.c |  2 +-
> > >  kernel/auditsc.c | 64 
> > > +---
> > >  net/bridge/netfilter/ebtables.c  |  2 +-
> > >  net/core/dev.c   |  2 +-
> > >  net/netfilter/x_tables.c |  2 +-
> > >  net/netlabel/netlabel_user.c |  2 +-
> > >  security/integrity/ima/ima_api.c |  2 +-
> > >  security/integrity/integrity_audit.c |  2 +-
> > >  security/lsm_audit.c |  2 +-
> > >  security/selinux/hooks.c |  4 +--
> > >  security/selinux/selinuxfs.c |  6 ++--
> > >  security/selinux/ss/services.c   | 12 +++
> > >  15 files changed, 64 insertions(+), 60 deletions(-)
> > 
> > Merged, but there was some fuzz due to the missing 1/5 patch and a
> > handfull of checkpatch.pl fixes.  Please take a look at the commit in
> > the audit/next branch and if anything looks awry please send a patch
> > to fix it.
> 
> Some of that fuzz was due to the two patches (ghak46/47) that went
> through the xelinux tree...  There will be a merge conflict.
> 
> Otherwise, looks ok.

Spoke too soon, missed one from the new seccomp actions_logged...

Patch pending...

> > paul moore
> 
> - RGB

- RGB

--
Richard Guy Briggs 
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

Re: [RFC bpf-next 11/11] Documentation: Describe bpf reference tracking

2018-05-14 Thread Alexei Starovoitov

On Wed, May 09, 2018 at 02:07:09PM -0700, Joe Stringer wrote:
> Signed-off-by: Joe Stringer 
> ---
>  Documentation/networking/filter.txt | 64 
> +
>  1 file changed, 64 insertions(+)
> 
> diff --git a/Documentation/networking/filter.txt 
> b/Documentation/networking/filter.txt
> index 5032e1263bc9..77be17977bc5 100644
> --- a/Documentation/networking/filter.txt
> +++ b/Documentation/networking/filter.txt
> @@ -1125,6 +1125,14 @@ pointer type.  The types of pointers describe their 
> base, as follows:
>  PTR_TO_STACKFrame pointer.
>  PTR_TO_PACKET   skb->data.
>  PTR_TO_PACKET_END   skb->data + headlen; arithmetic forbidden.
> +PTR_TO_SOCKET   Pointer to struct bpf_sock_ops, implicitly 
> refcounted.
> +PTR_TO_SOCKET_OR_NULL
> +Either a pointer to a socket, or NULL; socket lookup
> +returns this type, which becomes a PTR_TO_SOCKET when
> +checked != NULL. PTR_TO_SOCKET is reference-counted,
> +so programs must release the reference through the
> +socket release function before the end of the 
> program.
> +Arithmetic on these pointers is forbidden.
>  However, a pointer may be offset from this base (as a result of pointer
>  arithmetic), and this is tracked in two parts: the 'fixed offset' and 
> 'variable
>  offset'.  The former is used when an exactly-known value (e.g. an immediate
> @@ -1168,6 +1176,13 @@ over the Ethernet header, then reads IHL and addes 
> (IHL * 4), the resulting
>  pointer will have a variable offset known to be 4n+2 for some n, so adding 
> the 2
>  bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses 
> through
>  that pointer are safe.
> +The 'id' field is also used on PTR_TO_SOCKET and PTR_TO_SOCKET_OR_NULL, 
> common
> +to all copies of the pointer returned from a socket lookup. This has similar
> +behaviour to the handling for PTR_TO_MAP_VALUE_OR_NULL->PTR_TO_MAP_VALUE, but
> +it also handles reference tracking for the pointer. PTR_TO_SOCKET implicitly
> +represents a reference to the corresponding 'struct sock'. To ensure that the
> +reference is not leaked, it is imperative to NULL-check the reference and in
> +the non-NULL case, and pass the valid reference to the socket release 
> function.
>  
>  Direct packet access
>  
> @@ -1441,6 +1456,55 @@ Error:
>8: (7a) *(u64 *)(r0 +0) = 1
>R0 invalid mem access 'imm'
>  
> +Program that performs a socket lookup then sets the pointer to NULL without
> +checking it:
> +value:
> +  BPF_MOV64_IMM(BPF_REG_2, 0),
> +  BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8),
> +  BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
> +  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
> +  BPF_MOV64_IMM(BPF_REG_3, 4),
> +  BPF_MOV64_IMM(BPF_REG_4, 0),
> +  BPF_MOV64_IMM(BPF_REG_5, 0),
> +  BPF_EMIT_CALL(BPF_FUNC_sk_lookup),
> +  BPF_MOV64_IMM(BPF_REG_0, 0),
> +  BPF_EXIT_INSN(),
> +Error:
> +  0: (b7) r2 = 0
> +  1: (63) *(u32 *)(r10 -8) = r2
> +  2: (bf) r2 = r10
> +  3: (07) r2 += -8
> +  4: (b7) r3 = 4
> +  5: (b7) r4 = 0
> +  6: (b7) r5 = 0
> +  7: (85) call bpf_sk_lookup#65
> +  8: (b7) r0 = 0
> +  9: (95) exit
> +  Unreleased reference id=1, alloc_insn=7
> +
> +Program that performs a socket lookup but does not NULL-check the returned
> +value:
> +  BPF_MOV64_IMM(BPF_REG_2, 0),
> +  BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8),
> +  BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
> +  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
> +  BPF_MOV64_IMM(BPF_REG_3, 4),
> +  BPF_MOV64_IMM(BPF_REG_4, 0),
> +  BPF_MOV64_IMM(BPF_REG_5, 0),
> +  BPF_EMIT_CALL(BPF_FUNC_sk_lookup),
> +  BPF_EXIT_INSN(),
> +Error:
> +  0: (b7) r2 = 0
> +  1: (63) *(u32 *)(r10 -8) = r2
> +  2: (bf) r2 = r10
> +  3: (07) r2 += -8
> +  4: (b7) r3 = 4
> +  5: (b7) r4 = 0
> +  6: (b7) r5 = 0
> +  7: (85) call bpf_sk_lookup#65
> +  8: (95) exit
> +  Unreleased reference id=1, alloc_insn=7

Nice. Thank you for updating this doc. We haven't touched it in long time.
It probably long overdue for complete overhaul.

Re: [RFC bpf-next 07/11] bpf: Add helper to retrieve socket in BPF

2018-05-14 Thread Alexei Starovoitov

On Fri, May 11, 2018 at 05:54:33PM -0700, Joe Stringer wrote:
> On 11 May 2018 at 14:41, Martin KaFai Lau  wrote:
> > On Fri, May 11, 2018 at 02:08:01PM -0700, Joe Stringer wrote:
> >> On 10 May 2018 at 22:00, Martin KaFai Lau  wrote:
> >> > On Wed, May 09, 2018 at 02:07:05PM -0700, Joe Stringer wrote:
> >> >> This patch adds a new BPF helper function, sk_lookup() which allows BPF
> >> >> programs to find out if there is a socket listening on this host, and
> >> >> returns a socket pointer which the BPF program can then access to
> >> >> determine, for instance, whether to forward or drop traffic. sk_lookup()
> >> >> takes a reference on the socket, so when a BPF program makes use of this
> >> >> function, it must subsequently pass the returned pointer into the newly
> >> >> added sk_release() to return the reference.
> >> >>
> >> >> By way of example, the following pseudocode would filter inbound
> >> >> connections at XDP if there is no corresponding service listening for
> >> >> the traffic:
> >> >>
> >> >>   struct bpf_sock_tuple tuple;
> >> >>   struct bpf_sock_ops *sk;
> >> >>
> >> >>   populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet
> >> >>   sk = bpf_sk_lookup(ctx, &tuple, sizeof tuple, netns, 0);
> >> >>   if (!sk) {
> >> >> // Couldn't find a socket listening for this traffic. Drop.
> >> >> return TC_ACT_SHOT;
> >> >>   }
> >> >>   bpf_sk_release(sk, 0);
> >> >>   return TC_ACT_OK;
> >> >>
> >> >> Signed-off-by: Joe Stringer 
> >> >> ---
> >>
> >> ...
> >>
> >> >> @@ -4032,6 +4036,96 @@ static const struct bpf_func_proto 
> >> >> bpf_skb_get_xfrm_state_proto = {
> >> >>  };
> >> >>  #endif
> >> >>
> >> >> +struct sock *
> >> >> +sk_lookup(struct net *net, struct bpf_sock_tuple *tuple) {
> >> > Would it be possible to have another version that
> >> > returns a sk without taking its refcnt?
> >> > It may have performance benefit.
> >>
> >> Not really. The sockets are not RCU-protected, and established sockets
> >> may be torn down without notice. If we don't take a reference, there's
> >> no guarantee that the socket will continue to exist for the duration
> >> of running the BPF program.
> >>
> >> From what I follow, the comment below has a hidden implication which
> >> is that sockets without SOCK_RCU_FREE, eg established sockets, may be
> >> directly freed regardless of RCU.
> > Right, SOCK_RCU_FREE sk is the one I am concern about.
> > For example, TCP_LISTEN socket does not require taking a refcnt
> > now.  Doing a bpf_sk_lookup() may have a rather big
> > impact on handling TCP syn flood.  or the usual intention
> > is to redirect instead of passing it up to the stack?
> 
> I see, if you're only interested in listen sockets then probably this
> series could be extended with a new flag, eg something like
> BPF_F_SK_FIND_LISTENERS which restricts the set of possible sockets
> found to only listen sockets, then the implementation would call into
> __inet_lookup_listener() instead of inet_lookup(). The presence of
> that flag in the relevant register during CALL instruction would show
> that the verifier should not reference-track the result, then there'd
> need to be a check on the release to ensure that this unreferenced
> socket is never released. Just a thought, completely untested and I
> could still be missing some detail..
> 
> That said, I don't completely follow how you would expect to handle
> the traffic for sockets that are already established - the helper
> would no longer find those sockets, so you wouldn't know whether to
> pass the traffic up the stack for established traffic or not.

I think Martin has a valid concern here that if somebody starts using
this helper on the rx traffic the bpf program (via these two new
helpers) will be doing refcnt++ and refcnt-- even for listener
sockets which will cause synflood to suffer.
One can argue that this is bpf author mistake, but without fixes
(and api changes) to the helper the programmer doesn't really have a way
of avoiding this situation.
Also udp sockets don't need refcnt at all.
How about we split this single helper into three:
- bpf_sk_lookup_tcp_established() that will return refcnt-ed socket
and has to be bpf_sk_release()d by the program.
- bpf_sk_lookup_tcp_listener() that doesn't refcnt, since progs
run in rcu.
- bpf_sk_lookup_udp() that also doesn't refcnt.
The logic you want to put into this helper can be easily
replicated with these three helpers and the whole thing will
be much faster.
Thoughts?

Re: [PATCH net-next v3 0/3] sctp: Introduce sctp_flush_ctx

2018-05-14 Thread David Miller

From: Marcelo Ricardo Leitner 
Date: Mon, 14 May 2018 14:35:17 -0300

> This struct will hold all the context used during the outq flush, so we
> don't have to pass lots of pointers all around.
> 
> Checked on x86_64, the compiler inlines all these functions and there is no
> derreference added because of the struct.
> 
> This patchset depends on 'sctp: refactor sctp_outq_flush'
> 
> Changes since v1:
> - updated to build on top of v2 of 'sctp: refactor sctp_outq_flush'
> 
> Changes since v2:
> - fixed a rebase issue which reverted a change in patch 2.
> - rebased on v3 of 'sctp: refactor sctp_outq_flush'

Also applied, thanks Marcelo.

Re: [PATCH ghak81 RFC V2 3/5] audit: use inline function to get audit context

2018-05-14 Thread Richard Guy Briggs

On 2018-05-14 17:44, Paul Moore wrote:
> On Sat, May 12, 2018 at 9:58 PM, Richard Guy Briggs  wrote:
> > Recognizing that the audit context is an internal audit value, use an
> > access function to retrieve the audit context pointer for the task
> > rather than reaching directly into the task struct to get it.
> >
> > Signed-off-by: Richard Guy Briggs 
> > ---
> >  include/linux/audit.h| 14 ++--
> >  include/net/xfrm.h   |  2 +-
> >  kernel/audit.c   |  6 ++--
> >  kernel/audit_watch.c |  2 +-
> >  kernel/auditsc.c | 64 
> > +---
> >  net/bridge/netfilter/ebtables.c  |  2 +-
> >  net/core/dev.c   |  2 +-
> >  net/netfilter/x_tables.c |  2 +-
> >  net/netlabel/netlabel_user.c |  2 +-
> >  security/integrity/ima/ima_api.c |  2 +-
> >  security/integrity/integrity_audit.c |  2 +-
> >  security/lsm_audit.c |  2 +-
> >  security/selinux/hooks.c |  4 +--
> >  security/selinux/selinuxfs.c |  6 ++--
> >  security/selinux/ss/services.c   | 12 +++
> >  15 files changed, 64 insertions(+), 60 deletions(-)
> 
> Merged, but there was some fuzz due to the missing 1/5 patch and a
> handfull of checkpatch.pl fixes.  Please take a look at the commit in
> the audit/next branch and if anything looks awry please send a patch
> to fix it.

Some of that fuzz was due to the two patches (ghak46/47) that went
through the xelinux tree...  There will be a merge conflict.

Otherwise, looks ok.

> paul moore

- RGB

--
Richard Guy Briggs 
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

Re: [RFC bpf-next 06/11] bpf: Add reference tracking to verifier

2018-05-14 Thread Alexei Starovoitov

On Wed, May 09, 2018 at 02:07:04PM -0700, Joe Stringer wrote:
> Allow helper functions to acquire a reference and return it into a
> register. Specific pointer types such as the PTR_TO_SOCKET will
> implicitly represent such a reference. The verifier must ensure that
> these references are released exactly once in each path through the
> program.
> 
> To achieve this, this commit assigns an id to the pointer and tracks it
> in the 'bpf_func_state', then when the function or program exits,
> verifies that all of the acquired references have been freed. When the
> pointer is passed to a function that frees the reference, it is removed
> from the 'bpf_func_state` and all existing copies of the pointer in
> registers are marked invalid.
> 
> Signed-off-by: Joe Stringer 
> ---
>  include/linux/bpf_verifier.h |  18 ++-
>  kernel/bpf/verifier.c| 295 
> ---
>  2 files changed, 292 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index 9dcd87f1d322..8dbee360b3ec 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -104,6 +104,11 @@ struct bpf_stack_state {
>   u8 slot_type[BPF_REG_SIZE];
>  };
>  
> +struct bpf_reference_state {
> + int id;
> + int insn_idx; /* allocation insn */

the insn_idx is for more verbose messages, right?
It doesn't seem to affect the safety of algorithm.
Please add a comment to clarify that.

> +};
> +
>  /* state of the program:
>   * type of all registers and stack info
>   */
> @@ -122,7 +127,9 @@ struct bpf_func_state {
>*/
>   u32 subprogno;
>  
> - /* should be second to last. See copy_func_state() */
> + /* The following fields should be last. See copy_func_state() */
> + int acquired_refs;
> + struct bpf_reference_state *refs;
>   int allocated_stack;
>   struct bpf_stack_state *stack;
>  };
> @@ -218,11 +225,16 @@ void bpf_verifier_vlog(struct bpf_verifier_log *log, 
> const char *fmt,
>  __printf(2, 3) void bpf_verifier_log_write(struct bpf_verifier_env *env,
>  const char *fmt, ...);
>  
> -static inline struct bpf_reg_state *cur_regs(struct bpf_verifier_env *env)
> +static inline struct bpf_func_state *cur_func(struct bpf_verifier_env *env)
>  {
>   struct bpf_verifier_state *cur = env->cur_state;
>  
> - return cur->frame[cur->curframe]->regs;
> + return cur->frame[cur->curframe];
> +}
> +
> +static inline struct bpf_reg_state *cur_regs(struct bpf_verifier_env *env)
> +{
> + return cur_func(env)->regs;
>  }
>  
>  int bpf_prog_offload_verifier_prep(struct bpf_verifier_env *env);
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index f426ebf2b6bf..92b9a5dc465a 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -1,5 +1,6 @@
>  /* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
>   * Copyright (c) 2016 Facebook
> + * Copyright (c) 2018 Covalent IO, Inc. http://covalent.io
>   *
>   * This program is free software; you can redistribute it and/or
>   * modify it under the terms of version 2 of the GNU General Public
> @@ -140,6 +141,18 @@ static const struct bpf_verifier_ops * const 
> bpf_verifier_ops[] = {
>   *
>   * After the call R0 is set to return type of the function and registers 
> R1-R5
>   * are set to NOT_INIT to indicate that they are no longer readable.
> + *
> + * The following reference types represent a potential reference to a kernel
> + * resource which, after first being allocated, must be checked and freed by
> + * the BPF program:
> + * - PTR_TO_SOCKET_OR_NULL, PTR_TO_SOCKET
> + *
> + * When the verifier sees a helper call return a reference type, it 
> allocates a
> + * pointer id for the reference and stores it in the current function state.
> + * Similar to the way that PTR_TO_MAP_VALUE_OR_NULL is converted into
> + * PTR_TO_MAP_VALUE, PTR_TO_SOCKET_OR_NULL becomes PTR_TO_SOCKET when the 
> type
> + * passes through a NULL-check conditional. For the branch wherein the state 
> is
> + * changed to CONST_IMM, the verifier releases the reference.
>   */
>  
>  /* verifier_state + insn_idx are pushed to stack when branch is encountered 
> */
> @@ -229,7 +242,42 @@ static bool type_is_pkt_pointer(enum bpf_reg_type type)
>  
>  static bool reg_type_may_be_null(enum bpf_reg_type type)
>  {
> - return type == PTR_TO_MAP_VALUE_OR_NULL;
> + return type == PTR_TO_MAP_VALUE_OR_NULL ||
> +type == PTR_TO_SOCKET_OR_NULL;
> +}
> +
> +static bool type_is_refcounted(enum bpf_reg_type type)
> +{
> + return type == PTR_TO_SOCKET;
> +}
> +
> +static bool type_is_refcounted_or_null(enum bpf_reg_type type)
> +{
> + return type == PTR_TO_SOCKET || type == PTR_TO_SOCKET_OR_NULL;
> +}
> +
> +static bool reg_is_refcounted(const struct bpf_reg_state *reg)
> +{
> + return type_is_refcounted(reg->type);
> +}
> +
> +static bool reg_is_refcounted_or_null(const struct bpf_re

Re: [PATCH net-next v3 0/8] sctp: refactor sctp_outq_flush

2018-05-14 Thread David Miller

From: Marcelo Ricardo Leitner 
Date: Mon, 14 May 2018 14:34:35 -0300

> Currently sctp_outq_flush does many different things and arguably
> unrelated, such as doing transport selection and outq dequeueing.
> 
> This patchset refactors it into smaller and more dedicated functions.
> The end behavior should be the same.
> 
> The next patchset will rework the function parameters.
> 
> Changes since v1:
> - fix build issues on patches 3 and 4, and updated 5 and 8 because of
>   it.
> 
> Changes since v2:
> - fixed panic if building with just up to patch 3 applied

Series applied, thanks.

Re: [PATCH] lib/rhashtable: reorder some inititalization sequences

2018-05-14 Thread David Miller

From: Davidlohr Bueso 
Date: Mon, 14 May 2018 08:13:32 -0700

> rhashtable_init() allocates memory at the very end of the
> call, once everything is setup; with the exception of the
> nelems parameter. However, unless the user is doing something
> bogus with params for which -EINVAL is returned, memory
> allocation is the only operation that can trigger the call
> to fail.
> 
> Thus move bucket_table_alloc() up such that we fail back to
> the caller asap, instead of doing useless checks. This is
> safe as the the table allocation isn't using the halfly
> setup 'ht' structure and bucket_table_alloc() call chain only
> ends up using the ht->nulls_base member in INIT_RHT_NULLS_HEAD.
> 
> Also move the locking initialization down to the end.
> 
> Signed-off-by: Davidlohr Bueso 

The user potentially "doing something bogus" is why the most
expensive part of the initialization (the memory allocation)
is done after everything else is validated.

I think it's best to keep things as-is.

[PATCH RFC net-next 3/7] selftests: fib_tests: Add success-fail counts

2018-05-14 Thread David Ahern

As more tests are added, it is convenient to have a tally at the end.

Signed-off-by: David Ahern 
---
 tools/testing/selftests/net/fib_tests.sh | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/tools/testing/selftests/net/fib_tests.sh 
b/tools/testing/selftests/net/fib_tests.sh
index 9164e60d4b66..7e2291161e15 100755
--- a/tools/testing/selftests/net/fib_tests.sh
+++ b/tools/testing/selftests/net/fib_tests.sh
@@ -18,8 +18,10 @@ log_test()
 
if [ ${rc} -eq ${expected} ]; then
printf "TEST: %-60s  [ OK ]\n" "${msg}"
+   nsuccess=$((nsuccess+1))
else
ret=1
+   nfail=$((nfail+1))
printf "TEST: %-60s  [FAIL]\n" "${msg}"
if [ "${PAUSE_ON_FAIL}" = "yes" ]; then
echo
@@ -598,4 +600,9 @@ cleanup &> /dev/null
 
 fib_test
 
+if [ "$TESTS" != "none" ]; then
+   printf "\nTests passed: %3d\n" ${nsuccess}
+   printf "Tests failed: %3d\n"   ${nfail}
+fi
+
 exit $ret
-- 
2.11.0

[PATCH RFC net-next 5/7] selftests: fib_tests: Add option to pause after each test

2018-05-14 Thread David Ahern

Add option to pause after each test before cleanup is done. Allows
user to do manual inspection or more ad-hoc testing after each test
with the setup in tact.

Signed-off-by: David Ahern 
---
 tools/testing/selftests/net/fib_tests.sh | 13 +
 1 file changed, 13 insertions(+)

diff --git a/tools/testing/selftests/net/fib_tests.sh 
b/tools/testing/selftests/net/fib_tests.sh
index 8c99f0689efc..12b648826151 100755
--- a/tools/testing/selftests/net/fib_tests.sh
+++ b/tools/testing/selftests/net/fib_tests.sh
@@ -9,6 +9,7 @@ ret=0
 TESTS="unregister down carrier nexthop"
 VERBOSE=0
 PAUSE_ON_FAIL=no
+PAUSE=no
 IP="ip -netns testns"
 
 log_test()
@@ -31,6 +32,13 @@ log_test()
[ "$a" = "q" ] && exit 1
fi
fi
+
+   if [ "${PAUSE}" = "yes" ]; then
+   echo
+   echo "hit enter to continue, 'q' to quit"
+   read a
+   [ "$a" = "q" ] && exit 1
+   fi
 }
 
 setup()
@@ -576,6 +584,7 @@ usage: ${0##*/} OPTS
 -tTest(s) to run (default: all)
 (options: $TESTS)
 -p  Pause on fail
+-P  Pause after each test before cleanup
 -v  verbose mode (show commands and output)
 EOF
 }
@@ -588,6 +597,7 @@ do
case $o in
t) TESTS=$OPTARG;;
p) PAUSE_ON_FAIL=yes;;
+   P) PAUSE=yes;;
v) VERBOSE=$(($VERBOSE + 1));;
h) usage; exit 0;;
*) usage; exit 1;;
@@ -596,6 +606,9 @@ done
 
 PEER_CMD="ip netns exec ${PEER_NS}"
 
+# make sure we don't pause twice
+[ "${PAUSE}" = "yes" ] && PAUSE_ON_FAIL=no
+
 if [ "$(id -u)" -ne 0 ];then
echo "SKIP: Need root privileges"
exit 0
-- 
2.11.0

[PATCH RFC net-next 2/7] net/ipv6: Simplify route replace and appending into multipath route

2018-05-14 Thread David Ahern

Remove rt6_qualify_for_ecmp which is just guess work. It fails in 2 cases:
1. can not replace a route with a reject route. Existing code appends
   a new route instead of replacing the existing one.

2. can not have a multipath route where a leg uses a dev only nexthop

Existing use cases affected by this change:
1. adding a route with existing prefix and metric using NLM_F_CREATE
   without NLM_F_APPEND or NLM_F_EXCL (ie., what iproute2 calls
   'prepend'). Existing code auto-determines that the new nexthop can
   be appended to an existing route to create a multipath route. This
   change breaks that by requiring the APPEND flag for the new route
   to be added to an existing one. Instead the prepend just adds another
   route entry.

2. route replace. Existing code replaces first matching multipath route
   if new route is multipath capable and fallback to first matching
   non-ECMP route (reject or dev only route) in case one isn't available.
   New behavior replaces first matching route. (Thanks to Ido for spotting
   this one)

Note: Newer iproute2 is needed to display multipath routes with a dev-only
  nexthop. This is due to a bug in iproute2 and parsing nexthops.

Signed-off-by: David Ahern 
---
 include/net/ip6_route.h |   6 --
 net/ipv6/ip6_fib.c  | 157 ++--
 net/ipv6/route.c|   3 +-
 3 files changed, 73 insertions(+), 93 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 4cf1ef935ed9..9e4d0f0aeb6d 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -66,12 +66,6 @@ static inline bool rt6_need_strict(const struct in6_addr 
*daddr)
(IPV6_ADDR_MULTICAST | IPV6_ADDR_LINKLOCAL | 
IPV6_ADDR_LOOPBACK);
 }
 
-static inline bool rt6_qualify_for_ecmp(const struct fib6_info *f6i)
-{
-   return (f6i->fib6_flags & (RTF_GATEWAY|RTF_ADDRCONF|RTF_DYNAMIC)) ==
-  RTF_GATEWAY;
-}
-
 void ip6_route_input(struct sk_buff *skb);
 struct dst_entry *ip6_route_input_lookup(struct net *net,
 struct net_device *dev,
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index f0a4262a4789..f57ec3bc12d1 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -927,19 +927,19 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct 
fib6_info *rt,
 {
struct fib6_info *leaf = rcu_dereference_protected(fn->leaf,
lockdep_is_held(&rt->fib6_table->tb6_lock));
-   struct fib6_info *iter = NULL;
+   struct fib6_info *iter = NULL, *match = NULL;
struct fib6_info __rcu **ins;
-   struct fib6_info __rcu **fallback_ins = NULL;
int replace = (info->nlh &&
   (info->nlh->nlmsg_flags & NLM_F_REPLACE));
+   int append = (info->nlh &&
+  (info->nlh->nlmsg_flags & NLM_F_APPEND));
int add = (!info->nlh ||
   (info->nlh->nlmsg_flags & NLM_F_CREATE));
int found = 0;
-   bool rt_can_ecmp = rt6_qualify_for_ecmp(rt);
u16 nlflags = NLM_F_EXCL;
int err;
 
-   if (info->nlh && (info->nlh->nlmsg_flags & NLM_F_APPEND))
+   if (append)
nlflags |= NLM_F_APPEND;
 
ins = &fn->leaf;
@@ -961,13 +961,8 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct 
fib6_info *rt,
 
nlflags &= ~NLM_F_EXCL;
if (replace) {
-   if (rt_can_ecmp == rt6_qualify_for_ecmp(iter)) {
-   found++;
-   break;
-   }
-   if (rt_can_ecmp)
-   fallback_ins = fallback_ins ?: ins;
-   goto next_iter;
+   found++;
+   break;
}
 
if (rt6_duplicate_nexthop(iter, rt)) {
@@ -982,86 +977,67 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct 
fib6_info *rt,
fib6_metric_set(iter, RTAX_MTU, rt->fib6_pmtu);
return -EEXIST;
}
-   /* If we have the same destination and the same metric,
-* but not the same gateway, then the route we try to
-* add is sibling to this route, increment our counter
-* of siblings, and later we will add our route to the
-* list.
-* Only static routes (which don't have flag
-* RTF_EXPIRES) are used for ECMPv6.
-*
-* To avoid long list, we only had siblings if the
-* route have a gateway.
-*/
-   if (rt_can_ecmp &&
-   rt6_qualify_

[PATCH RFC net-next 6/7] selftests: fib_tests: Add ipv6 route add append replace tests

2018-05-14 Thread David Ahern

Add IPv6 route tests covering add, append and replace permutations.
Assumes the ability to add a basic single path route works; this is
required for example when adding an address to an interface.

$ fib_tests.sh -t ipv6_rt

IPv6 route add / append tests
TEST: Attempt to add duplicate route - gw   [ OK ]
TEST: Attempt to add duplicate route - dev only [ OK ]
TEST: Attempt to add duplicate route - reject route [ OK ]
TEST: Add new nexthop for existing prefix   [ OK ]
TEST: Append leg to existing route - gw [ OK ]
TEST: Append leg to existing route - dev only   [ OK ]
TEST: Append leg to existing route - reject route   [ OK ]
TEST: Append leg to existing reject route - gw  [ OK ]
TEST: Append leg to existing reject route - dev only[ OK ]
TEST: add multipath route   [ OK ]
TEST: Attempt to add duplicate multipath route  [ OK ]
TEST: route add with different metrics  [ OK ]
TEST: route delete with metric  [ OK ]

IPv6 route replace tests
TEST: single path with single path  [ OK ]
TEST: single path with multipath[ OK ]
TEST: single path with reject route [ OK ]
TEST: single path with single path via multipath attribute  [ OK ]
TEST: invalid nexthop   [ OK ]
TEST: single path - replace of non-existent route   [ OK ]
TEST: multipath with multipath  [ OK ]
TEST: multipath with single path[ OK ]
TEST: multipath with single path via multipath attribute[ OK ]
TEST: multipath with reject route   [ OK ]
TEST: multipath - invalid first nexthop [ OK ]
TEST: multipath - invalid second nexthop[ OK ]
TEST: multipath - replace of non-existent route [ OK ]

Signed-off-by: David Ahern 
---
 tools/testing/selftests/net/fib_tests.sh | 331 ++-
 1 file changed, 330 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/fib_tests.sh 
b/tools/testing/selftests/net/fib_tests.sh
index 12b648826151..20cb1d2c4e72 100755
--- a/tools/testing/selftests/net/fib_tests.sh
+++ b/tools/testing/selftests/net/fib_tests.sh
@@ -6,7 +6,7 @@
 
 ret=0
 
-TESTS="unregister down carrier nexthop"
+TESTS="unregister down carrier nexthop ipv6_rt"
 VERBOSE=0
 PAUSE_ON_FAIL=no
 PAUSE=no
@@ -574,6 +574,334 @@ fib_nexthop_test()
 }
 
 

+# Tests on route add and replace
+
+run_cmd()
+{
+   local cmd="$1"
+   local out
+   local stderr="2>/dev/null"
+
+   if [ "$VERBOSE" = "1" ]; then
+   printf "COMMAND: $cmd\n"
+   stderr=
+   fi
+
+   out=$(eval $cmd $stderr)
+   rc=$?
+   if [ "$VERBOSE" = "1" -a -n "$out" ]; then
+   echo "$out"
+   fi
+
+   [ "$VERBOSE" = "1" ] && echo
+
+   return $rc
+}
+
+# add route for a prefix, flushing any existing routes first
+# expected to be the first step of a test
+add_route6()
+{
+   local pfx="$1"
+   local nh="$2"
+   local out
+
+   if [ "$VERBOSE" = "1" ]; then
+   echo
+   echo "##"
+   echo
+   fi
+
+   run_cmd "$IP -6 ro flush ${pfx}"
+   [ $? -ne 0 ] && exit 1
+
+   out=$($IP -6 ro ls match ${pfx})
+   if [ -n "$out" ]; then
+   echo "Failed to flush routes for prefix used for tests."
+   exit 1
+   fi
+
+   run_cmd "$IP -6 ro add ${pfx} ${nh}"
+   if [ $? -ne 0 ]; then
+   echo "Failed to add initial route for test."
+   exit 1
+   fi
+}
+
+# add initial route - used in replace route tests
+add_initial_route6()
+{
+   add_route6 "2001:db8:104::/64" "$1"
+}
+
+check_route6()
+{
+   local pfx="2001:db8:104::/64"
+   local expected="$1"
+   local out
+   local rc=0
+
+   out=$($IP -6 ro ls match ${pfx} | sed -e 's/ pref medium//')
+   if [ -z "${out}" ]; then
+   if [ "$VERBOSE" = "1" ]; then
+   printf "\nNo route entry found\n"
+   printf "Expected:\n"
+   printf "${expected}\n"
+   fi
+   return 1
+   fi
+
+   # tricky way to convert output to 1-line without ip's
+   # messy '\'; this drops all extra white space
+   out=$(ech

[PATCH RFC net-next 0/7] net/ipv6: Fix route append and replace use cases

2018-05-14 Thread David Ahern

This patch set fixes a few append and replace uses cases for IPv6 and
adds test cases that codifies the expectations of how append and replace
are expected to work. In paricular it allows a multipath route to have
a dev-only nexthop, something Thomas tried to accomplish with commit
edd7ceb78296 ("ipv6: Allow non-gateway ECMP for IPv6") which had to be
reverted because of breakage, and to replace an existing FIB entry
with a reject route.

There are a number of inconsistent and surprising aspects to the Linux
API for adding, deleting, replacing and changing FIB entries. For example,
with IPv4 NLM_F_APPEND means insert the route after any existing entries
with the same key (prefix + priority + TOS for IPv4) and NLM_F_CREATE
without the append flag inserts the new route before any existing entries.

IPv6 on the other hand attempts to guess whether a new route should be
appended to an existing one, possibly creating a multipath route, or to
add a new entry after any existing ones. This applies to both the 'append'
(NLM_F_CREATE + NLM_F_APPEND) and 'prepend' (NLM_F_CREATE only) cases
meaning for IPv6 the NLM_F_APPEND is basically ignored. This guessing
whether the route should be added to a multipath route (gateway routes)
or inserted after existing entries (non-gateway based routes) means a
multipath route can not have a dev only nexthop (potentially required in
some cases - tunnels or VRF route leaking for example) and route 'replace'
is a bit adhoc treating gateway based routes and dev-only / reject routes
differently.

This has led to frustration with developers working on routing suites
such as FRR where workarounds such as delete and add.

After this patch set there are 2 differences between IPv4 and IPv6:
1. 'ip ro prepend' = NLM_F_CREATE only
IPv4 adds the new route before any existing ones
IPv6 adds new route after any existing ones

2. 'ip ro append' = NLM_F_CREATE|NLM_F_APPEND
   IPv4 adds the new route after any existing ones
   IPv6 adds the nexthop to existing routes converting to multipath

For the former, there are cases where we want same prefix routes added
after existing ones (e.g., multicast, prefix routes for macvlan when used
for virtual router redundancy). Requiring the APPEND flag to add a new
route to an existing one helps here but is a slight change in behavior
since prepend with gateway routes now create a separate entry.

For the latter IPv6 behavior is preferred - appending a route for the same
prefix and metric to make a multipath route, so really IPv4 not allowing an
existing route to be updated is the limiter. This will be fixed when
nexthops become separate objects - a future patch set.

Thank you to Thomas and Ido for testing earlier versions of this set, and
to Ido for providing an update to the mlxsw driver.

David Ahern (7):
  mlxsw: spectrum_router: Add support for route append
  net/ipv6: Simplify appending route into multipath route
  selftests: fib_tests: Add success-fail counts
  selftests: fib_tests: Add command line options
  selftests: fib_tests: Add option to pause after each test
  selftests: fib_tests: Add ipv6 route add append replace tests
  selftests: fib_tests: Add ipv4 route add append replace tests

 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  |   2 +
 include/net/ip6_route.h|   6 -
 net/ipv6/ip6_fib.c | 157 +++--
 net/ipv6/route.c   |   3 +-
 tools/testing/selftests/net/fib_tests.sh   | 673 -
 5 files changed, 737 insertions(+), 104 deletions(-)

-- 
2.11.0

[PATCH RFC net-next 1/7] mlxsw: spectrum_router: Add support for route append

2018-05-14 Thread David Ahern

Handle append for gateway based routes. Dev-only multipath routes will
be handled by a follow on patch.

Signed-off-by: Ido Schimmel 
Signed-off-by: David Ahern 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 8028d221aece..77b2adb29341 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -5725,6 +5725,7 @@ static void mlxsw_sp_router_fib6_event_work(struct 
work_struct *work)
 
switch (fib_work->event) {
case FIB_EVENT_ENTRY_REPLACE: /* fall through */
+   case FIB_EVENT_ENTRY_APPEND: /* fall through */
case FIB_EVENT_ENTRY_ADD:
replace = fib_work->event == FIB_EVENT_ENTRY_REPLACE;
err = mlxsw_sp_router_fib6_add(mlxsw_sp,
@@ -5831,6 +5832,7 @@ static void mlxsw_sp_router_fib6_event(struct 
mlxsw_sp_fib_event_work *fib_work,
 
switch (fib_work->event) {
case FIB_EVENT_ENTRY_REPLACE: /* fall through */
+   case FIB_EVENT_ENTRY_APPEND: /* fall through */
case FIB_EVENT_ENTRY_ADD: /* fall through */
case FIB_EVENT_ENTRY_DEL:
fen6_info = container_of(info, struct fib6_entry_notifier_info,
-- 
2.11.0

[PATCH RFC net-next 4/7] selftests: fib_tests: Add command line options

2018-05-14 Thread David Ahern

Add command line options for controlling pause on fail, controlling
specific tests to run and verbose mode rather than relying on environment
variables.

Signed-off-by: David Ahern 
---
 tools/testing/selftests/net/fib_tests.sh | 53 
 1 file changed, 40 insertions(+), 13 deletions(-)

diff --git a/tools/testing/selftests/net/fib_tests.sh 
b/tools/testing/selftests/net/fib_tests.sh
index 7e2291161e15..8c99f0689efc 100755
--- a/tools/testing/selftests/net/fib_tests.sh
+++ b/tools/testing/selftests/net/fib_tests.sh
@@ -6,8 +6,9 @@
 
 ret=0
 
-VERBOSE=${VERBOSE:=0}
-PAUSE_ON_FAIL=${PAUSE_ON_FAIL:=no}
+TESTS="unregister down carrier nexthop"
+VERBOSE=0
+PAUSE_ON_FAIL=no
 IP="ip -netns testns"
 
 log_test()
@@ -565,20 +566,36 @@ fib_nexthop_test()
 }
 
 

-#
+# usage
 
-fib_test()
+usage()
 {
-   if [ -n "$TEST" ]; then
-   eval $TEST
-   else
-   fib_unreg_test
-   fib_down_test
-   fib_carrier_test
-   fib_nexthop_test
-   fi
+   cat <   Test(s) to run (default: all)
+(options: $TESTS)
+-p  Pause on fail
+-v  verbose mode (show commands and output)
+EOF
 }
 
+
+# main
+
+while getopts :t:pPhv o
+do
+   case $o in
+   t) TESTS=$OPTARG;;
+   p) PAUSE_ON_FAIL=yes;;
+   v) VERBOSE=$(($VERBOSE + 1));;
+   h) usage; exit 0;;
+   *) usage; exit 1;;
+   esac
+done
+
+PEER_CMD="ip netns exec ${PEER_NS}"
+
 if [ "$(id -u)" -ne 0 ];then
echo "SKIP: Need root privileges"
exit 0
@@ -598,7 +615,17 @@ fi
 # start clean
 cleanup &> /dev/null
 
-fib_test
+for t in $TESTS
+do
+   case $t in
+   fib_unreg_test|unregister)  fib_unreg_test;;
+   fib_down_test|down) fib_down_test;;
+   fib_carrier_test|carrier)   fib_carrier_test;;
+   fib_nexthop_test|nexthop)   fib_nexthop_test;;
+
+   help) echo "Test names: $TESTS"; exit 0;;
+   esac
+done
 
 if [ "$TESTS" != "none" ]; then
printf "\nTests passed: %3d\n" ${nsuccess}
-- 
2.11.0

[PATCH RFC net-next 7/7] selftests: fib_tests: Add ipv4 route add append replace tests

2018-05-14 Thread David Ahern

Add IPv4 route tests covering add, append and replace permutations.
Assumes the ability to add a basic single path route works; this is
required for example when adding an address to an interface.

$ fib_tests.sh -t ipv4_rt
IPv4 route add / append tests
TEST: Attempt to add duplicate route - gw   [ OK ]
TEST: Attempt to add duplicate route - dev only [ OK ]
TEST: Attempt to add duplicate route - reject route [ OK ]
TEST: Add new nexthop for existing prefix   [ OK ]
TEST: Append leg to existing route - gw [ OK ]
TEST: Append leg to existing route - dev only   [ OK ]
TEST: Append leg to existing route - reject route   [ OK ]
TEST: Append leg to existing reject route - gw  [ OK ]
TEST: Append leg to existing reject route - dev only[ OK ]
TEST: add multipath route   [ OK ]
TEST: Attempt to add duplicate multipath route  [ OK ]
TEST: route add with different metrics  [ OK ]
TEST: route delete with metric  [ OK ]

IPv4 route replace tests
TEST: single path with single path  [ OK ]
TEST: single path with multipath[ OK ]
TEST: single path with reject route [ OK ]
TEST: single path with single path via multipath attribute  [ OK ]
TEST: invalid nexthop   [ OK ]
TEST: single path - replace of non-existent route   [ OK ]
TEST: multipath with multipath  [ OK ]
TEST: multipath with single path[ OK ]
TEST: multipath with single path via multipath attribute[ OK ]
TEST: multipath with reject route   [ OK ]
TEST: multipath - invalid first nexthop [ OK ]
TEST: multipath - invalid second nexthop[ OK ]
TEST: multipath - replace of non-existent route [ OK ]

Signed-off-by: David Ahern 
---
 tools/testing/selftests/net/fib_tests.sh | 277 ++-
 1 file changed, 276 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/fib_tests.sh 
b/tools/testing/selftests/net/fib_tests.sh
index 20cb1d2c4e72..19248453a2ba 100755
--- a/tools/testing/selftests/net/fib_tests.sh
+++ b/tools/testing/selftests/net/fib_tests.sh
@@ -6,7 +6,7 @@
 
 ret=0
 
-TESTS="unregister down carrier nexthop ipv6_rt"
+TESTS="unregister down carrier nexthop ipv6_rt ipv4_rt"
 VERBOSE=0
 PAUSE_ON_FAIL=no
 PAUSE=no
@@ -702,6 +702,12 @@ route_setup()
$IP -6 addr add 2001:db8:103::2/64 dev veth4
$IP -6 addr add 2001:db8:104::1/64 dev dummy1
 
+   $IP addr add 172.16.101.1/24 dev veth1
+   $IP addr add 172.16.101.2/24 dev veth2
+   $IP addr add 172.16.103.1/24 dev veth3
+   $IP addr add 172.16.103.2/24 dev veth4
+   $IP addr add 172.16.104.1/24 dev dummy1
+
set +ex
 }
 
@@ -901,6 +907,274 @@ ipv6_route_test()
route_cleanup
 }
 
+# add route for a prefix, flushing any existing routes first
+# expected to be the first step of a test
+add_route()
+{
+   local pfx="$1"
+   local nh="$2"
+   local out
+
+   if [ "$VERBOSE" = "1" ]; then
+   echo
+   echo "##"
+   echo
+   fi
+
+   run_cmd "$IP ro flush ${pfx}"
+   [ $? -ne 0 ] && exit 1
+
+   out=$($IP ro ls match ${pfx})
+   if [ -n "$out" ]; then
+   echo "Failed to flush routes for prefix used for tests."
+   exit 1
+   fi
+
+   run_cmd "$IP ro add ${pfx} ${nh}"
+   if [ $? -ne 0 ]; then
+   echo "Failed to add initial route for test."
+   exit 1
+   fi
+}
+
+# add initial route - used in replace route tests
+add_initial_route()
+{
+   add_route "172.16.104.0/24" "$1"
+}
+
+check_route()
+{
+   local pfx="172.16.104.0/24"
+   local expected="$1"
+   local out
+   local rc=0
+
+   out=$($IP ro ls match ${pfx})
+   if [ -z "${out}" ]; then
+   if [ "$VERBOSE" = "1" ]; then
+   printf "\nNo route entry found\n"
+   printf "Expected:\n"
+   printf "${expected}\n"
+   fi
+   return 1
+   fi
+
+   # tricky way to convert output to 1-line without ip's
+   # messy '\'; this drops all extra white space
+   out=$(echo ${out})
+   if [ "${out}" != "${expected}" ]; then
+   rc=1
+   if [ "${VERBOSE}" = "1" ]; then
+

Re: [PATCH net-next] cxgb4: add tc flower match support for tunnel VNI

2018-05-14 Thread David Miller

From: Ganesh Goudar 
Date: Mon, 14 May 2018 17:51:21 +0530

> From: Kumar Sanghvi 
> 
> Adds support for matching flows based on tunnel VNI value.
> Introduces fw APIs for allocating/removing MPS entries related
> to encapsulation. And uses the same while adding/deleting filters
> for offloading flows based on tunnel VNI match.
> 
> Signed-off-by: Kumar Sanghvi 
> Signed-off-by: Ganesh Goudar 

Applied, thank you.

Re: [PATCH net 2/2] vmxnet3: use DMA memory barriers where required

2018-05-14 Thread David Miller

From: 
Date: Mon, 14 May 2018 08:14:49 -0400

> The gen bits must be read first from (resp. written last to) DMA memory.
> The proper way to enforce this on Linux is to call dma_rmb() (resp.
> dma_wmb()).
> 
> Signed-off-by: Regis Duchesne 
> Acked-by: Ronak Doshi 

Applied and queued up for -stable.

Re: [PATCH net 1/2] vmxnet3: set the DMA mask before the first DMA map operation

2018-05-14 Thread David Miller

From: 
Date: Mon, 14 May 2018 08:28:26 -0400

> The DMA mask must be set before, not after, the first DMA map operation, or
> the first DMA map operation could in theory fail on some systems.
> 
> Fixes: b0eb57cb97e78 ("VMXNET3: Add support for virtual IOMMU")
> Signed-off-by: Regis Duchesne 
> Acked-by: Ronak Doshi 

Applied and queued up for -stable.

Re: [PATCH net] cxgb4: Correct ntuple mask validation for hash filters

2018-05-14 Thread David Miller

From: Ganesh Goudar 
Date: Mon, 14 May 2018 16:27:34 +0530

> From: Kumar Sanghvi 
> 
> Earlier code of doing bitwise AND with field width bits was wrong.
> Instead, simplify code to calculate ntuple_mask based on supplied
> fields and then compare with mask configured in hw - which is the
> correct and simpler way to validate ntuple mask.
> 
> Fixes: 3eb8b62d5a26 ("cxgb4: add support to create hash-filters via tc-flower 
> offload")
> Signed-off-by: Kumar Sanghvi 
> Signed-off-by: Ganesh Goudar 

Applied and queued up for -stable.

Re: [PATCH net-next] net: stmmac: Add Jose Abreu as co-maintainer

2018-05-14 Thread David Miller

From: Jose Abreu 
Date: Mon, 14 May 2018 10:29:56 +0100

> I'm offering to be a co-maintainer for stmmac driver.
> 
> As per discussion with Alexandre, I will arranje to get STM32 boards to
> test patches in GMAC version 3.x and 4.1. I also have HW to test GMAC
> version 5.
> 
> Looking forward to contribute to net-dev!
> 
> Signed-off-by: Jose Abreu 

Applied with commit message typo fixed.

Re: [RFC bpf-next 04/11] bpf: Add PTR_TO_SOCKET verifier type

2018-05-14 Thread Alexei Starovoitov

On Wed, May 09, 2018 at 02:07:02PM -0700, Joe Stringer wrote:
> Teach the verifier a little bit about a new type of pointer, a
> PTR_TO_SOCKET. This pointer type is accessed from BPF through the
> 'struct bpf_sock' structure.
> 
> Signed-off-by: Joe Stringer 
> ---
>  include/linux/bpf.h  | 19 +-
>  include/linux/bpf_verifier.h |  2 ++
>  kernel/bpf/verifier.c| 86 
> ++--
>  net/core/filter.c| 30 +---
>  4 files changed, 114 insertions(+), 23 deletions(-)

Ack for patches 1-3. In this one few nits:

> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index a38e474bf7ee..a03b4b0edcb6 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -136,7 +136,7 @@ enum bpf_arg_type {
>   /* the following constraints used to prototype bpf_memcmp() and other
>* functions that access data on eBPF program stack
>*/
> - ARG_PTR_TO_MEM, /* pointer to valid memory (stack, packet, map 
> value) */
> + ARG_PTR_TO_MEM, /* pointer to valid memory (stack, packet, map 
> value, socket) */

I don't see where in this patch this change happens...

>   ARG_PTR_TO_MEM_OR_NULL, /* pointer to valid memory or NULL */
>   ARG_PTR_TO_UNINIT_MEM,  /* pointer to memory does not need to be 
> initialized,
>* helper function must fill all bytes or clear
> @@ -148,6 +148,7 @@ enum bpf_arg_type {
>  
>   ARG_PTR_TO_CTX, /* pointer to context */
>   ARG_ANYTHING,   /* any (initialized) argument is ok */
> + ARG_PTR_TO_SOCKET,  /* pointer to bpf_sock */
>  };
>  
>  /* type of values returned from helper functions */
> @@ -155,6 +156,7 @@ enum bpf_return_type {
>   RET_INTEGER,/* function returns integer */
>   RET_VOID,   /* function doesn't return anything */
>   RET_PTR_TO_MAP_VALUE_OR_NULL,   /* returns a pointer to map elem value 
> or NULL */
> + RET_PTR_TO_SOCKET_OR_NULL,  /* returns a pointer to a socket or 
> NULL */
>  };
>  
>  /* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF 
> programs
> @@ -205,6 +207,8 @@ enum bpf_reg_type {
>   PTR_TO_PACKET_META,  /* skb->data - meta_len */
>   PTR_TO_PACKET,   /* reg points to skb->data */
>   PTR_TO_PACKET_END,   /* skb->data + headlen */
> + PTR_TO_SOCKET,   /* reg points to struct bpf_sock */
> + PTR_TO_SOCKET_OR_NULL,   /* reg points to struct bpf_sock or NULL */
>  };
>  
>  /* The information passed from prog-specific *_is_valid_access
> @@ -326,6 +330,11 @@ const struct bpf_func_proto 
> *bpf_get_trace_printk_proto(void);
>  
>  typedef unsigned long (*bpf_ctx_copy_t)(void *dst, const void *src,
>   unsigned long off, unsigned long len);
> +typedef u32 (*bpf_convert_ctx_access_t)(enum bpf_access_type type,
> + const struct bpf_insn *src,
> + struct bpf_insn *dst,
> + struct bpf_prog *prog,
> + u32 *target_size);
>  
>  u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 
> meta_size,
>void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy);
> @@ -729,4 +738,12 @@ extern const struct bpf_func_proto 
> bpf_sock_map_update_proto;
>  void bpf_user_rnd_init_once(void);
>  u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
>  
> +bool bpf_sock_is_valid_access(int off, int size, enum bpf_access_type type,
> +   struct bpf_insn_access_aux *info);
> +u32 bpf_sock_convert_ctx_access(enum bpf_access_type type,
> + const struct bpf_insn *si,
> + struct bpf_insn *insn_buf,
> + struct bpf_prog *prog,
> + u32 *target_size);
> +
>  #endif /* _LINUX_BPF_H */
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index a613b52ce939..9dcd87f1d322 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -57,6 +57,8 @@ struct bpf_reg_state {
>* offset, so they can share range knowledge.
>* For PTR_TO_MAP_VALUE_OR_NULL this is used to share which map value we
>* came from, when one is tested for != NULL.
> +  * For PTR_TO_SOCKET this is used to share which pointers retain the
> +  * same reference to the socket, to determine proper reference freeing.
>*/
>   u32 id;
>   /* Ordering of fields matters.  See states_equal() */
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 1b31b805dea4..d38c7c1e9da6 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -80,8 +80,8 @@ static const struct bpf_verifier_ops * const 
> bpf_verifier_ops[] = {
>   * (like pointer plus pointer becomes SCALA

Re: [PATCH v1 2/4] media: bpf: allow raw IR decoder bpf programs to be used

2018-05-14 Thread kbuild test robot

Hi Sean,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.17-rc5]
[cannot apply to next-20180514]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Sean-Young/media-rc-introduce-BPF_PROG_IR_DECODER/20180515-093234
config: i386-allmodconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

>> drivers/media/rc/lirc_dev.c:32:10: fatal error: linux/bpf-rcdev.h: No such 
>> file or directory
#include 
 ^~~
   compilation terminated.
--
>> kernel/bpf/syscall.c:30:10: fatal error: linux/bpf-rcdev.h: No such file or 
>> directory
#include 
 ^~~
   compilation terminated.

vim +32 drivers/media/rc/lirc_dev.c

31  
  > 32  #include 
33  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [PATCH bpf-next v2 0/8] Minor follow-up cleanups in BPF JITs and optimized imm emission

2018-05-14 Thread Alexei Starovoitov

On Mon, May 14, 2018 at 11:22:26PM +0200, Daniel Borkmann wrote:
> This series follows up mostly with with some minor cleanups on top
> of 'Move ld_abs/ld_ind to native BPF' as well as implements better
> 32/64 bit immediate load into register and saves tail call init on
> cBPF for the arm64 JIT. Last but not least we add a couple of test
> cases. For details please see individual patches. Thanks!
> 
> v1 -> v2:
>   - Minor fix in i64_i16_blocks() to remove 24 shift.
>   - Added last two patches.
>   - Added Acks from prior round.

Applied, thanks.

Re: [PATCH bpf-next v2 8/8] bpf: add ld64 imm test cases

2018-05-14 Thread Y Song

On Mon, May 14, 2018 at 2:22 PM, Daniel Borkmann  wrote:
> Add test cases where we combine semi-random imm values, mainly for testing
> JITs when they have different encoding options for 64 bit immediates in
> order to reduce resulting image size.
>
> Signed-off-by: Daniel Borkmann 

Acked-by: Yonghong Song

Re: [PATCH v1 3/4] media: rc bpf: move ir_raw_event to uapi

2018-05-14 Thread kbuild test robot

Hi Sean,

I love your patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v4.17-rc5]
[cannot apply to next-20180514]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Sean-Young/media-rc-introduce-BPF_PROG_IR_DECODER/20180515-093234
config: i386-tinyconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All warnings (new ones prefixed by >>):

>> ./usr/include/linux/bpf_rcdev.h:13: found __[us]{8,16,32,64} type without 
>> #include 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [PATCH v4 1/1] drivers core: multi-threading device shutdown

2018-05-14 Thread Pavel Tatashin

Hi Andy,

Thank you for your comments. I will send an updated patch soon. My replies are 
below:

On 05/14/2018 04:04 PM, Andy Shevchenko wrote:

> Can we still preserve an order here? (Yes, even if the entire list is
> not fully ordered)
> In the context I see it would go before netdevice.h.

Sure, I will move kthread.h.

>> +static struct device *
>> +device_get_child_by_index(struct device *parent, int index)
>> +{
>> +   struct klist_iter i;
>> +   struct device *dev = NULL, *d;
>> +   int child_index = 0;
>> +
>> +   if (!parent->p || index < 0)
>> +   return NULL;
>> +
>> +   klist_iter_init(&parent->p->klist_children, &i);
>> +   while ((d = next_device(&i))) {
>> +   if (child_index == index) {
>> +   dev = d;
>> +   break;
>> +   }
>> +   child_index++;
>> +   }
>> +   klist_iter_exit(&i);
>> +
>> +   return dev;
>> +}
> 
> This can be implemented as a subfunction to device_find_child(), can't it be?

Yes, but that would make it very inefficient to search for an index in a list 
via function pointer call.

> 
>> +/**
> 
> Hmm... Why it's marked as kernel doc while it's just a plain comment?
> Same applies to the rest of similar comments.

Fixed this, thanks!

> 
>> +   for (i = 0; i < children_count; i++) {
>> +   if (device_shutdown_serial) {
>> +   device_shutdown_child_task(&tdata);
>> +   } else {
>> +   kthread_run(device_shutdown_child_task,
>> +   &tdata, "device_shutdown.%s",
>> +   dev_name(dev));
>> +   }
>> +   }
> 
> Can't we just use device_for_each_child() instead?

No, at least without doing some memory allocation. Notice in this loop we are 
not traversing through children, instead we are starting number of children 
threads, and each thread finds a child to work on. Otherwise we would have to 
pass child pointer via argument, and we would need to keep that argument in 
some memory.

Pavel

Re: [PATCH 05/14] net: sched: always take reference to action

2018-05-14 Thread kbuild test robot

Hi Vlad,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net/master]
[also build test WARNING on v4.17-rc5 next-20180514]
[cannot apply to net-next/master]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Vlad-Buslov/Modify-action-API-for-implementing-lockless-actions/20180515-025420
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

   net/sched/act_api.c:71:15: sparse: incorrect type in initializer (different 
address spaces) @@expected struct tc_cookie [noderef] *__ret @@
got [noderef] *__ret @@
   net/sched/act_api.c:71:15:expected struct tc_cookie [noderef] 
*__ret
   net/sched/act_api.c:71:15:got struct tc_cookie *new_cookie
   net/sched/act_api.c:71:13: sparse: incorrect type in assignment (different 
address spaces) @@expected struct tc_cookie *old @@got struct tc_cookie 
[noderef] *[assigned] __ret
>> net/sched/act_api.c:287:6: sparse: symbol '__tcf_idr_check' was not 
>> declared. Should it be static?
   net/sched/act_api.c:144:48: sparse: dereference of noderef expression

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation

[RFC PATCH] net: sched: __tcf_idr_check() can be static

2018-05-14 Thread kbuild test robot


Fixes: 446adedb5339 ("net: sched: always take reference to action")
Signed-off-by: Fengguang Wu 
---
 act_api.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 9459cce..27e80cf 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -284,8 +284,8 @@ int tcf_generic_walker(struct tc_action_net *tn, struct 
sk_buff *skb,
 }
 EXPORT_SYMBOL(tcf_generic_walker);
 
-bool __tcf_idr_check(struct tc_action_net *tn, u32 index, struct tc_action **a,
-int bind)
+static bool __tcf_idr_check(struct tc_action_net *tn, u32 index, struct 
tc_action **a,
+   int bind)
 {
struct tcf_idrinfo *idrinfo = tn->idrinfo;
struct tc_action *p;

Re: [kbuild-all] [PATCH net-next 2/2] sctp: add sctp_make_op_error_limited and reuse inner functions

2018-05-14 Thread Ye Xiaolong

On 05/14, Marcelo Ricardo Leitner wrote:
>On Mon, May 14, 2018 at 07:47:20PM +0800, Ye Xiaolong wrote:
>> On 05/14, Marcelo Ricardo Leitner wrote:
>> >On Mon, May 14, 2018 at 03:40:53PM +0800, Ye Xiaolong wrote:
>> >> >> config: x86_64-randconfig-x006-201817 (attached as .config)
>> >> >> compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
>> >> >> reproduce:
>> >> >> # save the attached .config to linux build tree
>> >> >> make ARCH=x86_64
>> >> >>
>> >> >> All errors (new ones prefixed by >>):
>> >> >>
>> >> >>net//sctp/sm_make_chunk.c: In function 'sctp_make_op_error_limited':
>> >> >> >> net//sctp/sm_make_chunk.c:1260:9: error: implicit declaration of 
>> >> >> >> function 'sctp_mtu_payload'; did you mean 'sctp_do_peeloff'? 
>> >> >> >> [-Werror=implicit-function-declaration]
>> >> >>  size = sctp_mtu_payload(sp, size, sizeof(struct sctp_errhdr));
>> >> >> ^~~~
>> >> >> sctp_do_peeloff
>> >> >>cc1: some warnings being treated as errors
>> >> >
>> >> >Seems the test didn't pick up the MTU refactor patchset yet.
>> >>
>> >> Do you mean your patchset require MTU refactor patchset as prerequisites?
>> >
>> >Yes.
>>
>> Then it is recommended to use '--base' option of git format-patch, it would 
>> record
>> the base tree info in the first patch or cover letter, 0day bot would apply 
>> your
>> patchset to right base according to it.
>
>Nice. I wasn't aware of it. Thanks.
>
>Considering that the MTU refactor patchset was already applied on
>net-next when the bot did the test, why should I have to specify the
>base?

Could you share me the subjects or commits of MTU refactor patcheset, I'll 
double
check what was wrong.

Thanks,
Xiaolong
>
>  Marcelo

Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG

2018-05-14 Thread Joonsoo Kim

Hello, Mikulas.

On Tue, Apr 24, 2018 at 02:41:47PM -0400, Mikulas Patocka wrote:
> 
> 
> On Tue, 24 Apr 2018, Matthew Wilcox wrote:
> 
> > On Tue, Apr 24, 2018 at 08:29:14AM -0400, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Mon, 23 Apr 2018, Matthew Wilcox wrote:
> > > 
> > > > On Mon, Apr 23, 2018 at 08:06:16PM -0400, Mikulas Patocka wrote:
> > > > > Some bugs (such as buffer overflows) are better detected
> > > > > with kmalloc code, so we must test the kmalloc path too.
> > > > 
> > > > Well now, this brings up another item for the collective TODO list --
> > > > implement redzone checks for vmalloc.  Unless this is something already
> > > > taken care of by kasan or similar.
> > > 
> > > The kmalloc overflow testing is also not ideal - it rounds the size up to 
> > > the next slab size and detects buffer overflows only at this boundary.
> > > 
> > > Some times ago, I made a "kmalloc guard" patch that places a magic number 
> > > immediatelly after the requested size - so that it can detect overflows 
> > > at 
> > > byte boundary 
> > > ( https://www.redhat.com/archives/dm-devel/2014-September/msg00018.html )
> > > 
> > > That patch found a bug in crypto code:
> > > ( http://lkml.iu.edu/hypermail/linux/kernel/1409.1/02325.html )
> > 
> > Is it still worth doing this, now we have kasan?
> 
> The kmalloc guard has much lower overhead than kasan.

I skimm at your code and it requires rebuilding the kernel.
I think that if rebuilding is required as the same with the KASAN,
using the KASAN is better since it has far better coverage for
detection the bug.

However, I think that if the redzone can be setup tightly
without rebuild, it would be worth implementing. I have an idea to
implement it only for the SLUB. Could I try it? (I'm asking this
because I'm inspired from the above patch.) :)
Or do you wanna try it?

Thanks.

Re: [PATCH bpf-next] selftests/bpf: make sure build-id is on

2018-05-14 Thread Y Song

On Mon, May 14, 2018 at 5:11 PM, Alexei Starovoitov  wrote:
> --build-id may not be a default linker config.
> Make sure it's used when linking urandom_read test program.
> Otherwise test_stacktrace_build_id[_nmi] tests will be failling.
>
> Signed-off-by: Alexei Starovoitov 

Tested and the change looks good.
Acked-by: Yonghong Song 

> ---
>  tools/testing/selftests/bpf/Makefile | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/bpf/Makefile 
> b/tools/testing/selftests/bpf/Makefile
> index 438d4f93875b..133ebc68cbe4 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -19,7 +19,7 @@ all: $(TEST_CUSTOM_PROGS)
>  $(TEST_CUSTOM_PROGS): urandom_read
>
>  urandom_read: urandom_read.c
> -   $(CC) -o $(TEST_CUSTOM_PROGS) -static $<
> +   $(CC) -o $(TEST_CUSTOM_PROGS) -static $< -Wl,--build-id
>
>  # Order correspond to 'make run_tests' order
>  TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map 
> test_progs \
> --
> 2.9.5
>

[PATCH bpf-next] selftests/bpf: make sure build-id is on

2018-05-14 Thread Alexei Starovoitov

--build-id may not be a default linker config.
Make sure it's used when linking urandom_read test program.
Otherwise test_stacktrace_build_id[_nmi] tests will be failling.

Signed-off-by: Alexei Starovoitov 
---
 tools/testing/selftests/bpf/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 438d4f93875b..133ebc68cbe4 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -19,7 +19,7 @@ all: $(TEST_CUSTOM_PROGS)
 $(TEST_CUSTOM_PROGS): urandom_read
 
 urandom_read: urandom_read.c
-   $(CC) -o $(TEST_CUSTOM_PROGS) -static $<
+   $(CC) -o $(TEST_CUSTOM_PROGS) -static $< -Wl,--build-id
 
 # Order correspond to 'make run_tests' order
 TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map 
test_progs \
-- 
2.9.5

[PATCH net-next] erspan: set bso bit based on mirrored packet's len

2018-05-14 Thread William Tu

Before the patch, the erspan BSO bit (Bad/Short/Oversized) is not
handled.  BSO has 4 possible values:
  00 --> Good frame with no error, or unknown integrity
  11 --> Payload is a Bad Frame with CRC or Alignment Error
  01 --> Payload is a Short Frame
  10 --> Payload is an Oversized Frame

Based the short/oversized definitions in RFC1757, the patch sets
the bso bit based on the mirrored packet's size.

Reported-by: Xiaoyan Jin 
Signed-off-by: William Tu 
---
 include/net/erspan.h | 25 +
 1 file changed, 25 insertions(+)

diff --git a/include/net/erspan.h b/include/net/erspan.h
index d044aa60cc76..5eb95f78ad45 100644
--- a/include/net/erspan.h
+++ b/include/net/erspan.h
@@ -219,6 +219,30 @@ static inline __be32 erspan_get_timestamp(void)
return htonl((u32)h_usecs);
 }
 
+/* ERSPAN BSO (Bad/Short/Oversized)
+ *   00b --> Good frame with no error, or unknown integrity
+ *   01b --> Payload is a Short Frame
+ *   10b --> Payload is an Oversized Frame
+ *   11b --> Payload is a Bad Frame with CRC or Alignment Error
+ */
+enum erspan_bso {
+   BSO_NOERROR,
+   BSO_SHORT,
+   BSO_OVERSIZED,
+   BSO_BAD,
+};
+
+static inline u8 erspan_detect_bso(struct sk_buff *skb)
+{
+   if (skb->len < ETH_ZLEN)
+   return BSO_SHORT;
+
+   if (skb->len > ETH_FRAME_LEN)
+   return BSO_OVERSIZED;
+
+   return BSO_NOERROR;
+}
+
 static inline void erspan_build_header_v2(struct sk_buff *skb,
  u32 id, u8 direction, u16 hwid,
  bool truncate, bool is_ipv4)
@@ -248,6 +272,7 @@ static inline void erspan_build_header_v2(struct sk_buff 
*skb,
vlan_tci = ntohs(qp->tci);
}
 
+   bso = erspan_detect_bso(skb);
skb_push(skb, sizeof(*ershdr) + ERSPAN_V2_MDSIZE);
ershdr = (struct erspan_base_hdr *)skb->data;
memset(ershdr, 0, sizeof(*ershdr) + ERSPAN_V2_MDSIZE);
-- 
2.7.4

Re: [PATCH net-next 3/3] udp: only use paged allocation with scatter-gather

2018-05-14 Thread Eric Dumazet



On 05/14/2018 04:30 PM, Willem de Bruijn wrote:

> I don't quite follow. The reported crash happens in the protocol layer,
> because of this check. With pagedlen we have not allocated
> sufficient space for the skb_put.
> 
> if (!(rt->dst.dev->features&NETIF_F_SG)) {
> unsigned int off;
> 
> off = skb->len;
> if (getfrag(from, skb_put(skb, copy),
> offset, copy, off, skb) < 0) {
> __skb_trim(skb, off);
> err = -EFAULT;
> goto error;
> }
> } else {
> int i = skb_shinfo(skb)->nr_frags;
> 
> Are you referring to a separate potential issue in the gso layer?
> If a bonding device advertises SG, but a slave does not, then
> skb_segment on the slave should build linear segs? I have not
> tested that.

Given that the device attribute could change under us, we need to not
crash, even if initially we thought NETIF_F_SG was available.

Unless you want to hold RTNL in UDP xmit :)

Ideally, GSO should be always on, as we did for TCP.

Otherwise, I can guarantee syzkaller will hit again.

Re: [PATCH 01/14] net: sched: use rcu for action cookie update

2018-05-14 Thread kbuild test robot

Hi Vlad,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net/master]
[also build test WARNING on v4.17-rc5 next-20180514]
[cannot apply to net-next/master]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Vlad-Buslov/Modify-action-API-for-implementing-lockless-actions/20180515-025420
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

>> net/sched/act_api.c:71:15: sparse: incorrect type in initializer (different 
>> address spaces) @@expected struct tc_cookie [noderef] *__ret @@   
>>  got [noderef] *__ret @@
   net/sched/act_api.c:71:15:expected struct tc_cookie [noderef] 
*__ret
   net/sched/act_api.c:71:15:got struct tc_cookie *new_cookie
>> net/sched/act_api.c:71:13: sparse: incorrect type in assignment (different 
>> address spaces) @@expected struct tc_cookie *old @@got struct 
>> tc_cookie [noderef] *[assigned] __ret
>> net/sched/act_api.c:132:48: sparse: dereference of noderef expression

vim +71 net/sched/act_api.c

65  
66  static void tcf_set_action_cookie(struct tc_cookie __rcu **old_cookie,
67struct tc_cookie *new_cookie)
68  {
69  struct tc_cookie *old;
70  
  > 71  old = xchg(old_cookie, new_cookie);
72  if (old)
73  call_rcu(&old->rcu, tcf_free_cookie_rcu);
74  }
75  
76  /* XXX: For standalone actions, we don't need a RCU grace period 
either, because
77   * actions are always connected to filters and filters are already 
destroyed in
78   * RCU callbacks, so after a RCU grace period actions are already 
disconnected
79   * from filters. Readers later can not find us.
80   */
81  static void free_tcf(struct tc_action *p)
82  {
83  free_percpu(p->cpu_bstats);
84  free_percpu(p->cpu_qstats);
85  
86  tcf_set_action_cookie(&p->act_cookie, NULL);
87  if (p->goto_chain)
88  tcf_action_goto_chain_fini(p);
89  
90  kfree(p);
91  }
92  
93  static void tcf_idr_remove(struct tcf_idrinfo *idrinfo, struct 
tc_action *p)
94  {
95  spin_lock_bh(&idrinfo->lock);
96  idr_remove(&idrinfo->action_idr, p->tcfa_index);
97  spin_unlock_bh(&idrinfo->lock);
98  gen_kill_estimator(&p->tcfa_rate_est);
99  free_tcf(p);
   100  }
   101  
   102  int __tcf_idr_release(struct tc_action *p, bool bind, bool strict)
   103  {
   104  int ret = 0;
   105  
   106  ASSERT_RTNL();
   107  
   108  if (p) {
   109  if (bind)
   110  p->tcfa_bindcnt--;
   111  else if (strict && p->tcfa_bindcnt > 0)
   112  return -EPERM;
   113  
   114  p->tcfa_refcnt--;
   115  if (p->tcfa_bindcnt <= 0 && p->tcfa_refcnt <= 0) {
   116  if (p->ops->cleanup)
   117  p->ops->cleanup(p);
   118  tcf_idr_remove(p->idrinfo, p);
   119  ret = ACT_P_DELETED;
   120  }
   121  }
   122  
   123  return ret;
   124  }
   125  EXPORT_SYMBOL(__tcf_idr_release);
   126  
   127  static size_t tcf_action_shared_attrs_size(const struct tc_action *act)
   128  {
   129  u32 cookie_len = 0;
   130  
   131  if (act->act_cookie)
 > 132  cookie_len = nla_total_size(act->act_cookie->len);
   133  
   134  return  nla_total_size(0) /* action number nested */
   135  + nla_total_size(IFNAMSIZ) /* TCA_ACT_KIND */
   136  + cookie_len /* TCA_ACT_COOKIE */
   137  + nla_total_size(0) /* TCA_ACT_STATS nested */
   138  /* TCA_STATS_BASIC */
   139  + nla_total_size_64bit(sizeof(struct gnet_stats_basic))
   140  /* TCA_STATS_QUEUE */
   141  + nla_total_size_64bit(sizeof(struct gnet_stats_queue))
   142  + nla_total_size(0) /* TCA_OPTIONS nested */
   143  + nla_total_size(sizeof(struct tcf_t)); /* TCA_GACT_TM 
*/
   144  }
   145  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation

Re: [PATCH net-next 3/3] udp: only use paged allocation with scatter-gather

2018-05-14 Thread Willem de Bruijn

On Mon, May 14, 2018 at 7:12 PM, Eric Dumazet  wrote:
>
>
> On 05/14/2018 04:07 PM, Willem de Bruijn wrote:
>> From: Willem de Bruijn 
>>
>> Paged allocation stores most payload in skb frags. This helps udp gso
>> by avoiding copying from the gso skb to segment skb in skb_segment.
>>
>> But without scatter-gather, data must be linear, so do not use paged
>> mode unless NETIF_F_SG.
>>
>> Fixes: 15e36f5b8e98 ("udp: paged allocation with gso")
>> Reported-by: Sean Tranchetti 
>> Signed-off-by: Willem de Bruijn 
>> ---
>>  net/ipv4/ip_output.c  | 2 +-
>>  net/ipv6/ip6_output.c | 2 +-
>>  2 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
>> index b5e21eb198d8..b38731d8a44f 100644
>> --- a/net/ipv4/ip_output.c
>> +++ b/net/ipv4/ip_output.c
>> @@ -884,7 +884,7 @@ static int __ip_append_data(struct sock *sk,
>>
>>   exthdrlen = !skb ? rt->dst.header_len : 0;
>>   mtu = cork->gso_size ? IP_MAX_MTU : cork->fragsize;
>> - paged = !!cork->gso_size;
>> + paged = cork->gso_size && (rt->dst.dev->features & NETIF_F_SG);
>>
>>   if (cork->tx_flags & SKBTX_ANY_SW_TSTAMP &&
>>   sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)
>> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
>> index 7f4493080df6..35a940b9f208 100644
>> --- a/net/ipv6/ip6_output.c
>> +++ b/net/ipv6/ip6_output.c
>> @@ -1262,7 +1262,7 @@ static int __ip6_append_data(struct sock *sk,
>>   dst_exthdrlen = rt->dst.header_len - rt->rt6i_nfheader_len;
>>   }
>>
>> - paged = !!cork->gso_size;
>> + paged = cork->gso_size && (rt->dst.dev->features & NETIF_F_SG);
>>   mtu = cork->gso_size ? IP6_MAX_MTU : cork->fragsize;
>>   orig_mtu = mtu;
>>
>>
>
> As I said, this wont help for stacked device
>
> bonding might advertise NETIF_F_SG, but one slave might not.

I don't quite follow. The reported crash happens in the protocol layer,
because of this check. With pagedlen we have not allocated
sufficient space for the skb_put.

if (!(rt->dst.dev->features&NETIF_F_SG)) {
unsigned int off;

off = skb->len;
if (getfrag(from, skb_put(skb, copy),
offset, copy, off, skb) < 0) {
__skb_trim(skb, off);
err = -EFAULT;
goto error;
}
} else {
int i = skb_shinfo(skb)->nr_frags;

Are you referring to a separate potential issue in the gso layer?
If a bonding device advertises SG, but a slave does not, then
skb_segment on the slave should build linear segs? I have not
tested that.

Re: [PATCH v1 1/4] media: rc: introduce BPF_PROG_IR_DECODER

2018-05-14 Thread Randy Dunlap

On 05/14/2018 02:10 PM, Sean Young wrote:
> Add support for BPF_PROG_IR_DECODER. This type of BPF program can call

Kconfig file below uses IR_BPF_DECODER instead of the symbol name above.

and then patch 3 says a third choice:
The context provided to a BPF_PROG_RAWIR_DECODER is a struct ir_raw_event;

> rc_keydown() to reported decoded IR scancodes, or rc_repeat() to report
> that the last key should be repeated.
> 
> Signed-off-by: Sean Young 
> ---
>  drivers/media/rc/Kconfig  |  8 +++
>  drivers/media/rc/Makefile |  1 +
>  drivers/media/rc/ir-bpf-decoder.c | 93 +++
>  include/linux/bpf_types.h |  3 +
>  include/uapi/linux/bpf.h  | 16 +-
>  5 files changed, 120 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/media/rc/ir-bpf-decoder.c
> 
> diff --git a/drivers/media/rc/Kconfig b/drivers/media/rc/Kconfig
> index eb2c3b6eca7f..10ad6167d87c 100644
> --- a/drivers/media/rc/Kconfig
> +++ b/drivers/media/rc/Kconfig
> @@ -120,6 +120,14 @@ config IR_IMON_DECODER
>  remote control and you would like to use it with a raw IR
>  receiver, or if you wish to use an encoder to transmit this IR.
>  
> +config IR_BPF_DECODER
> + bool "Enable IR raw decoder using BPF"
> + depends on BPF_SYSCALL
> + depends on RC_CORE=y
> + help
> +Enable this option to make it possible to load custom IR
> +decoders written in BPF.
> +
>  endif #RC_DECODERS
>  
>  menuconfig RC_DEVICES
> diff --git a/drivers/media/rc/Makefile b/drivers/media/rc/Makefile
> index 2e1c87066f6c..12e1118430d0 100644
> --- a/drivers/media/rc/Makefile
> +++ b/drivers/media/rc/Makefile
> @@ -5,6 +5,7 @@ obj-y += keymaps/
>  obj-$(CONFIG_RC_CORE) += rc-core.o
>  rc-core-y := rc-main.o rc-ir-raw.o
>  rc-core-$(CONFIG_LIRC) += lirc_dev.o
> +rc-core-$(CONFIG_IR_BPF_DECODER) += ir-bpf-decoder.o


-- 
~Randy

Re: [net 1/1] net/mlx5: Fix build break when CONFIG_SMP=n

2018-05-14 Thread Randy Dunlap

On 05/14/2018 03:38 PM, Saeed Mahameed wrote:
> Avoid using the kernel's irq_descriptor and return IRQ vector affinity
> directly from the driver.
> 
> This fixes the following build break when CONFIG_SMP=n
> 
> include/linux/mlx5/driver.h: In function ‘mlx5_get_vector_affinity_hint’:
> include/linux/mlx5/driver.h:1299:13: error:
> ‘struct irq_desc’ has no member named ‘affinity_hint’
> 
> Fixes: 6082d9c9c94a ("net/mlx5: Fix mlx5_get_vector_affinity function")
> Signed-off-by: Saeed Mahameed 
> CC: Randy Dunlap 
> CC: Guenter Roeck 
> CC: Thomas Gleixner 
> Tested-by: Israel Rukshin 

Reported-by: kbuild test robot 
Reported-by: Randy Dunlap 
Tested-by: Randy Dunlap 

Thanks.

> ---
> 
> For -stable v4.14
> 
>  include/linux/mlx5/driver.h | 12 +---
>  1 file changed, 1 insertion(+), 11 deletions(-)
> 
> diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
> index 2a156c5dfadd..d703774982ca 100644
> --- a/include/linux/mlx5/driver.h
> +++ b/include/linux/mlx5/driver.h
> @@ -1286,17 +1286,7 @@ enum {
>  static inline const struct cpumask *
>  mlx5_get_vector_affinity_hint(struct mlx5_core_dev *dev, int vector)
>  {
> - struct irq_desc *desc;
> - unsigned int irq;
> - int eqn;
> - int err;
> -
> - err = mlx5_vector2eqn(dev, vector, &eqn, &irq);
> - if (err)
> - return NULL;
> -
> - desc = irq_to_desc(irq);
> - return desc->affinity_hint;
> + return dev->priv.irq_info[vector].mask;
>  }
>  
>  #endif /* MLX5_DRIVER_H */
> 


-- 
~Randy

Re: [PATCH net-next 2/3] gso: limit udp gso to egress-only virtual devices

2018-05-14 Thread Willem de Bruijn

On Mon, May 14, 2018 at 7:07 PM, Willem de Bruijn
 wrote:
> From: Willem de Bruijn 
>
> Until the udp receive stack supports large packets (UDP GRO), GSO
> packets must not loop from the egress to the ingress path.
>
> Revert the change that added NETIF_F_GSO_UDP_L4 to various virtual
> devices through NETIF_F_GSO_ENCAP_ALL as this included devices that
> may loop packets, such as veth and macvlan.
>
> Instead add it to specific devices that forward to another device's
> egress path: bonding and team.
>
> Fixes: 83aa025f535f ("udp: add gso support to virtual devices")
> CC: Alexander Duyck 
> Signed-off-by: Willem de Bruijn 
> ---

> diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
> index 9dbd390ace34..c6a9f0cafea2 100644
> --- a/drivers/net/team/team.c
> +++ b/drivers/net/team/team.c
> @@ -1026,7 +1026,8 @@ static void __team_compute_features(struct team *team)
> }
>
> team->dev->vlan_features = vlan_features;
> -   team->dev->hw_enc_features = enc_features | NETIF_F_GSO_ENCAP_ALL;
> +   team->dev->hw_enc_features = enc_features | NETIF_F_GSO_ENCAP_ALL |
> +NETIF_GSO_UDP_L4;

This has a typo. team.ko did not build automatically for me and caught it
with a full compile just too late.

Need to send a v2, sorry.

Re: [PATCH net-next 3/3] udp: only use paged allocation with scatter-gather

2018-05-14 Thread Eric Dumazet



On 05/14/2018 04:07 PM, Willem de Bruijn wrote:
> From: Willem de Bruijn 
> 
> Paged allocation stores most payload in skb frags. This helps udp gso
> by avoiding copying from the gso skb to segment skb in skb_segment.
> 
> But without scatter-gather, data must be linear, so do not use paged
> mode unless NETIF_F_SG.
> 
> Fixes: 15e36f5b8e98 ("udp: paged allocation with gso")
> Reported-by: Sean Tranchetti 
> Signed-off-by: Willem de Bruijn 
> ---
>  net/ipv4/ip_output.c  | 2 +-
>  net/ipv6/ip6_output.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index b5e21eb198d8..b38731d8a44f 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -884,7 +884,7 @@ static int __ip_append_data(struct sock *sk,
>  
>   exthdrlen = !skb ? rt->dst.header_len : 0;
>   mtu = cork->gso_size ? IP_MAX_MTU : cork->fragsize;
> - paged = !!cork->gso_size;
> + paged = cork->gso_size && (rt->dst.dev->features & NETIF_F_SG);
>  
>   if (cork->tx_flags & SKBTX_ANY_SW_TSTAMP &&
>   sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)
> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index 7f4493080df6..35a940b9f208 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -1262,7 +1262,7 @@ static int __ip6_append_data(struct sock *sk,
>   dst_exthdrlen = rt->dst.header_len - rt->rt6i_nfheader_len;
>   }
>  
> - paged = !!cork->gso_size;
> + paged = cork->gso_size && (rt->dst.dev->features & NETIF_F_SG);
>   mtu = cork->gso_size ? IP6_MAX_MTU : cork->fragsize;
>   orig_mtu = mtu;
>  
> 

As I said, this wont help for stacked device

bonding might advertise NETIF_F_SG, but one slave might not.

Re: [PATCH net-next] udp: Fix kernel panic in UDP GSO path

2018-05-14 Thread Eric Dumazet



On 05/14/2018 03:45 PM, stran...@codeaurora.org wrote:
> On 2018-05-11 17:16, Willem de Bruijn wrote:
> 
>>> Hmm, no, we absolutely need to fix GSO instead.
>>>
>>> Think of a bonding device (or any virtual devices), your patch wont avoid 
>>> the crash.
> 
> Hi Eric. Can you clarify what you mean by "fix GSO?" Is that just having the 
> GSO path work
> regardless of whether or not SG is enabled for the device?
>

Yes. GSO is a fallback, and must work all the time, not panic.

Re: [PATCH v3 bpf-next 0/2] bpf: enable stackmap with build_id in nmi

2018-05-14 Thread Daniel Borkmann

On 05/07/2018 07:50 PM, Song Liu wrote:
> Changes v2 -> v3:
>   Improve syntax based on suggestion by Tobin C. Harding.
> 
> Changes v1 -> v2:
>   1. Rename some variables to (hopefully) reduce confusion;
>   2. Check irq_work status with IRQ_WORK_BUSY (instead of work->sem);
>   3. In Kconfig, let BPF_SYSCALL select IRQ_WORK;
>   4. Add static to DEFINE_PER_CPU();
>5. Remove pr_info() in stack_map_init().
> 
> Song Liu (2):
>   bpf: enable stackmap with build_id in nmi context
>   bpf: add selftest for stackmap with build_id in NMI context
> 
>  init/Kconfig   |   1 +
>  kernel/bpf/stackmap.c  |  59 +++--
>  tools/testing/selftests/bpf/test_progs.c   | 134 
> +
>  tools/testing/selftests/bpf/urandom_read.c |  10 ++-
>  4 files changed, 196 insertions(+), 8 deletions(-)

Applied to bpf-next, thanks Song!

[PATCH net-next 3/3] udp: only use paged allocation with scatter-gather

2018-05-14 Thread Willem de Bruijn

From: Willem de Bruijn 

Paged allocation stores most payload in skb frags. This helps udp gso
by avoiding copying from the gso skb to segment skb in skb_segment.

But without scatter-gather, data must be linear, so do not use paged
mode unless NETIF_F_SG.

Fixes: 15e36f5b8e98 ("udp: paged allocation with gso")
Reported-by: Sean Tranchetti 
Signed-off-by: Willem de Bruijn 
---
 net/ipv4/ip_output.c  | 2 +-
 net/ipv6/ip6_output.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index b5e21eb198d8..b38731d8a44f 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -884,7 +884,7 @@ static int __ip_append_data(struct sock *sk,
 
exthdrlen = !skb ? rt->dst.header_len : 0;
mtu = cork->gso_size ? IP_MAX_MTU : cork->fragsize;
-   paged = !!cork->gso_size;
+   paged = cork->gso_size && (rt->dst.dev->features & NETIF_F_SG);
 
if (cork->tx_flags & SKBTX_ANY_SW_TSTAMP &&
sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 7f4493080df6..35a940b9f208 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1262,7 +1262,7 @@ static int __ip6_append_data(struct sock *sk,
dst_exthdrlen = rt->dst.header_len - rt->rt6i_nfheader_len;
}
 
-   paged = !!cork->gso_size;
+   paged = cork->gso_size && (rt->dst.dev->features & NETIF_F_SG);
mtu = cork->gso_size ? IP6_MAX_MTU : cork->fragsize;
orig_mtu = mtu;
 
-- 
2.17.0.441.gb46fe60e1d-goog

Re: [PATCH bpf-next v2] samples/bpf: xdp_monitor, accept short options

2018-05-14 Thread Daniel Borkmann

On 05/14/2018 12:20 PM, Jesper Dangaard Brouer wrote:
> 
> On Mon, 14 May 2018 17:29:15 +0900 Prashant Bhole 
>  wrote:
> 
>> Updated optstring parameter for getopt_long() to accept short options.
>> Also updated usage() function.
>>
>> Signed-off-by: Prashant Bhole 
> 
> Acked-by: Jesper Dangaard Brouer 

Applied to bpf-next, thanks everyone!

Re: [PATCH net-next] udp: Fix kernel panic in UDP GSO path

2018-05-14 Thread Willem de Bruijn

>> Paged skbuffs is an optimization for gso, but the feature should
>> continue to work even if gso skbs are linear, indeed (if at the cost
>> of copying during skb_segment).
>>
>> We need to make paged contingent on scatter-gather. Rough
>> patch below. That is for ipv4 only, the same will be needed for ipv6.
>>
>> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
>> index b5e21eb198d8..b38731d8a44f 100644
>> --- a/net/ipv4/ip_output.c
>> +++ b/net/ipv4/ip_output.c
>> @@ -884,7 +884,7 @@ static int __ip_append_data(struct sock *sk,
>>
>> exthdrlen = !skb ? rt->dst.header_len : 0;
>> mtu = cork->gso_size ? IP_MAX_MTU : cork->fragsize;
>> -   paged = !!cork->gso_size;
>> +   paged = cork->gso_size && (rt->dst.dev->features & NETIF_F_SG);
>
>
> Hi Willem. That's definitely a much cleaner patch than ours since it allows
> the GSO to continue without failure.
> We tried it on both the IPv4 and IPv6 path and didn't see the crash in
> either case.

Thanks for testing. I have a small set of fixes to udp gso, including
this one. Let me send them right away.

[PATCH net-next 2/3] gso: limit udp gso to egress-only virtual devices

2018-05-14 Thread Willem de Bruijn

From: Willem de Bruijn 

Until the udp receive stack supports large packets (UDP GRO), GSO
packets must not loop from the egress to the ingress path.

Revert the change that added NETIF_F_GSO_UDP_L4 to various virtual
devices through NETIF_F_GSO_ENCAP_ALL as this included devices that
may loop packets, such as veth and macvlan.

Instead add it to specific devices that forward to another device's
egress path: bonding and team.

Fixes: 83aa025f535f ("udp: add gso support to virtual devices")
CC: Alexander Duyck 
Signed-off-by: Willem de Bruijn 
---
 drivers/net/bonding/bond_main.c | 5 +++--
 drivers/net/team/team.c | 5 +++--
 include/linux/netdev_features.h | 1 -
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 4176e1d95f47..d7b58370ae77 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1107,7 +1107,8 @@ static void bond_compute_features(struct bonding *bond)
 
 done:
bond_dev->vlan_features = vlan_features;
-   bond_dev->hw_enc_features = enc_features | NETIF_F_GSO_ENCAP_ALL;
+   bond_dev->hw_enc_features = enc_features | NETIF_F_GSO_ENCAP_ALL |
+   NETIF_F_GSO_UDP_L4;
bond_dev->gso_max_segs = gso_max_segs;
netif_set_gso_max_size(bond_dev, gso_max_size);
 
@@ -4263,7 +4264,7 @@ void bond_setup(struct net_device *bond_dev)
NETIF_F_HW_VLAN_CTAG_RX |
NETIF_F_HW_VLAN_CTAG_FILTER;
 
-   bond_dev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
+   bond_dev->hw_features |= NETIF_F_GSO_ENCAP_ALL | NETIF_F_GSO_UDP_L4;
bond_dev->features |= bond_dev->hw_features;
 }
 
diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index 9dbd390ace34..c6a9f0cafea2 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -1026,7 +1026,8 @@ static void __team_compute_features(struct team *team)
}
 
team->dev->vlan_features = vlan_features;
-   team->dev->hw_enc_features = enc_features | NETIF_F_GSO_ENCAP_ALL;
+   team->dev->hw_enc_features = enc_features | NETIF_F_GSO_ENCAP_ALL |
+NETIF_GSO_UDP_L4;
team->dev->hard_header_len = max_hard_header_len;
 
team->dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
@@ -2117,7 +2118,7 @@ static void team_setup(struct net_device *dev)
   NETIF_F_HW_VLAN_CTAG_RX |
   NETIF_F_HW_VLAN_CTAG_FILTER;
 
-   dev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
+   dev->hw_features |= NETIF_F_GSO_ENCAP_ALL | NETIF_F_GSO_UDP_L4;
dev->features |= dev->hw_features;
 }
 
diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index c87c3a3453c1..623bb8ced060 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -220,7 +220,6 @@ enum {
 NETIF_F_GSO_GRE_CSUM | \
 NETIF_F_GSO_IPXIP4 |   \
 NETIF_F_GSO_IPXIP6 |   \
-NETIF_F_GSO_UDP_L4 |   \
 NETIF_F_GSO_UDP_TUNNEL |   \
 NETIF_F_GSO_UDP_TUNNEL_CSUM)
 
-- 
2.17.0.441.gb46fe60e1d-goog

[PATCH net-next 0/3] udp gso fixes

2018-05-14 Thread Willem de Bruijn

From: Willem de Bruijn 

A few small fixes:
- disallow segmentation with XFRM
- do not leak gso packets into the ingress path
- fix a panic if scatter-gather is disabled

Willem de Bruijn (3):
  udp: exclude gso from xfrm paths
  gso: limit udp gso to egress-only virtual devices
  udp: only use paged allocation with scatter-gather

 drivers/net/bonding/bond_main.c | 5 +++--
 drivers/net/team/team.c | 5 +++--
 include/linux/netdev_features.h | 1 -
 net/ipv4/ip_output.c| 2 +-
 net/ipv4/udp.c  | 3 ++-
 net/ipv6/ip6_output.c   | 2 +-
 net/ipv6/udp.c  | 3 ++-
 7 files changed, 12 insertions(+), 9 deletions(-)

-- 
2.17.0.441.gb46fe60e1d-goog

[PATCH net-next 1/3] udp: exclude gso from xfrm paths

2018-05-14 Thread Willem de Bruijn

From: Willem de Bruijn 

UDP GSO conflicts with transformations in the XFRM layer.
Return an error if GSO is attempted.

Fixes: bec1f6f69736 ("udp: generate gso with UDP_SEGMENT")
CC: Michal Kubecek 
Signed-off-by: Willem de Bruijn 
---
 net/ipv4/udp.c | 3 ++-
 net/ipv6/udp.c | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ff4d4ba67735..d71f1f3e1155 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -788,7 +788,8 @@ static int udp_send_skb(struct sk_buff *skb, struct flowi4 
*fl4,
return -EINVAL;
if (sk->sk_no_check_tx)
return -EINVAL;
-   if (skb->ip_summed != CHECKSUM_PARTIAL || is_udplite)
+   if (skb->ip_summed != CHECKSUM_PARTIAL || is_udplite ||
+   dst_xfrm(skb_dst(skb)))
return -EIO;
 
skb_shinfo(skb)->gso_size = cork->gso_size;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 2839c1bd1e58..426c9d2b418d 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1053,7 +1053,8 @@ static int udp_v6_send_skb(struct sk_buff *skb, struct 
flowi6 *fl6,
return -EINVAL;
if (udp_sk(sk)->no_check6_tx)
return -EINVAL;
-   if (skb->ip_summed != CHECKSUM_PARTIAL || is_udplite)
+   if (skb->ip_summed != CHECKSUM_PARTIAL || is_udplite ||
+   dst_xfrm(skb_dst(skb)))
return -EIO;
 
skb_shinfo(skb)->gso_size = cork->gso_size;
-- 
2.17.0.441.gb46fe60e1d-goog

Re: [PATCH net-next] udp: Fix kernel panic in UDP GSO path

2018-05-14 Thread stranche


On 2018-05-11 17:16, Willem de Bruijn wrote:


Hmm, no, we absolutely need to fix GSO instead.

Think of a bonding device (or any virtual devices), your patch wont 
avoid the crash.


Hi Eric. Can you clarify what you mean by "fix GSO?" Is that just having 
the GSO path work

regardless of whether or not SG is enabled for the device?



Thanks for reporting the issue.

Paged skbuffs is an optimization for gso, but the feature should
continue to work even if gso skbs are linear, indeed (if at the cost
of copying during skb_segment).

We need to make paged contingent on scatter-gather. Rough
patch below. That is for ipv4 only, the same will be needed for ipv6.

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index b5e21eb198d8..b38731d8a44f 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -884,7 +884,7 @@ static int __ip_append_data(struct sock *sk,

exthdrlen = !skb ? rt->dst.header_len : 0;
mtu = cork->gso_size ? IP_MAX_MTU : cork->fragsize;
-   paged = !!cork->gso_size;
+   paged = cork->gso_size && (rt->dst.dev->features & NETIF_F_SG);


Hi Willem. That's definitely a much cleaner patch than ours since it 
allows the GSO to continue without failure.
We tried it on both the IPv4 and IPv6 path and didn't see the crash in 
either case.


-
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,

a Linux Foundation Collaborative Project

Re: [PATCH bpf-next v5 3/6] bpf: Add IPv6 Segment Routing helpers

2018-05-14 Thread Daniel Borkmann

On 05/12/2018 07:25 PM, Mathieu Xhonneux wrote:
[...]
> +BPF_CALL_4(bpf_lwt_seg6_store_bytes, struct sk_buff *, skb, u32, offset,
> +const void *, from, u32, len)
> +{
> +#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
> + struct seg6_bpf_srh_state *srh_state =
> + this_cpu_ptr(&seg6_bpf_srh_states);
> + void *srh_tlvs, *srh_end, *ptr;
> + struct ipv6_sr_hdr *srh;
> + int srhoff = 0;
> +
> + if (ipv6_find_hdr(skb, &srhoff, IPPROTO_ROUTING, NULL, NULL) < 0)
> + return -EINVAL;
> +
> + srh = (struct ipv6_sr_hdr *)(skb->data + srhoff);
> + srh_tlvs = (void *)((char *)srh + ((srh->first_segment + 1) << 4));
> + srh_end = (void *)((char *)srh + sizeof(*srh) + srh_state->hdrlen);

Do we need to check that this cannot go out of bounds wrt skb data?

> + ptr = skb->data + offset;
> + if (ptr >= srh_tlvs && ptr + len <= srh_end)
> + srh_state->valid = 0;
> + else if (ptr < (void *)&srh->flags ||
> +  ptr + len > (void *)&srh->segments)
> + return -EFAULT;
> +
> + if (unlikely(bpf_try_make_writable(skb, offset + len)))
> + return -EFAULT;
> +
> + memcpy(ptr, from, len);

You have a use after free here. bpf_try_make_writable() is potentially changing
underlying skb->data (e.g. see pskb_expand_head()). Therefore memcpy()'ing into
cached ptr is invalid.

> + return 0;
> +#else /* CONFIG_IPV6_SEG6_BPF */
> + return -EOPNOTSUPP;
> +#endif
> +}
> +
> +static const struct bpf_func_proto bpf_lwt_seg6_store_bytes_proto = {
> + .func   = bpf_lwt_seg6_store_bytes,
> + .gpl_only   = false,
> + .ret_type   = RET_INTEGER,
> + .arg1_type  = ARG_PTR_TO_CTX,
> + .arg2_type  = ARG_ANYTHING,
> + .arg3_type  = ARG_PTR_TO_MEM,
> + .arg4_type  = ARG_CONST_SIZE
> +};
> +
> +BPF_CALL_4(bpf_lwt_seg6_action, struct sk_buff *, skb,
> +u32, action, void *, param, u32, param_len)
> +{
> +#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
> + struct seg6_bpf_srh_state *srh_state =
> + this_cpu_ptr(&seg6_bpf_srh_states);
> + struct ipv6_sr_hdr *srh;
> + int srhoff = 0;
> + int err;
> +
> + if (ipv6_find_hdr(skb, &srhoff, IPPROTO_ROUTING, NULL, NULL) < 0)
> + return -EINVAL;
> + srh = (struct ipv6_sr_hdr *)(skb->data + srhoff);
> +
> + if (!srh_state->valid) {
> + if (unlikely((srh_state->hdrlen & 7) != 0))
> + return -EBADMSG;
> +
> + srh->hdrlen = (u8)(srh_state->hdrlen >> 3);
> + if (unlikely(!seg6_validate_srh(srh, (srh->hdrlen + 1) << 3)))
> + return -EBADMSG;
> +
> + srh_state->valid = 1;
> + }
> +
> + switch (action) {
> + case SEG6_LOCAL_ACTION_END_X:
> + if (param_len != sizeof(struct in6_addr))
> + return -EINVAL;
> + return seg6_lookup_nexthop(skb, (struct in6_addr *)param, 0);
> + case SEG6_LOCAL_ACTION_END_T:
> + if (param_len != sizeof(int))
> + return -EINVAL;
> + return seg6_lookup_nexthop(skb, NULL, *(int *)param);
> + case SEG6_LOCAL_ACTION_END_B6:
> + err = bpf_push_seg6_encap(skb, BPF_LWT_ENCAP_SEG6_INLINE,
> +   param, param_len);
> + if (!err)
> + srh_state->hdrlen =
> + ((struct ipv6_sr_hdr *)param)->hdrlen << 3;
> + return err;
> + case SEG6_LOCAL_ACTION_END_B6_ENCAP:
> + err = bpf_push_seg6_encap(skb, BPF_LWT_ENCAP_SEG6,
> +   param, param_len);
> + if (!err)
> + srh_state->hdrlen =
> + ((struct ipv6_sr_hdr *)param)->hdrlen << 3;
> + return err;
> + default:
> + return -EINVAL;
> + }
> +#else /* CONFIG_IPV6_SEG6_BPF */
> + return -EOPNOTSUPP;
> +#endif
> +}
> +
> +static const struct bpf_func_proto bpf_lwt_seg6_action_proto = {
> + .func   = bpf_lwt_seg6_action,
> + .gpl_only   = false,
> + .ret_type   = RET_INTEGER,
> + .arg1_type  = ARG_PTR_TO_CTX,
> + .arg2_type  = ARG_ANYTHING,
> + .arg3_type  = ARG_PTR_TO_MEM,
> + .arg4_type  = ARG_CONST_SIZE
> +};
> +
> +BPF_CALL_3(bpf_lwt_seg6_adjust_srh, struct sk_buff *, skb, u32, offset,
> +s32, len)
> +{
> +#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
> + struct seg6_bpf_srh_state *srh_state =
> + this_cpu_ptr(&seg6_bpf_srh_states);
> + void *srh_end, *srh_tlvs, *ptr;
> + struct ipv6_sr_hdr *srh;
> + struct ipv6hdr *hdr;
> + int srhoff = 0;
> + int ret;
> +
> + if (ipv6_find_hdr(skb, &srhoff, IPPROTO_ROUTING, NULL, NULL) < 0)
> + return -EINVAL;
> + srh = (struct ipv6_sr_hdr *)(skb->data + srhoff);
> +
> + srh_tlvs = (void *)((unsigned char *)srh + sizeof(*srh) +
> +

[net 1/1] net/mlx5: Fix build break when CONFIG_SMP=n

2018-05-14 Thread Saeed Mahameed

Avoid using the kernel's irq_descriptor and return IRQ vector affinity
directly from the driver.

This fixes the following build break when CONFIG_SMP=n

include/linux/mlx5/driver.h: In function ‘mlx5_get_vector_affinity_hint’:
include/linux/mlx5/driver.h:1299:13: error:
‘struct irq_desc’ has no member named ‘affinity_hint’

Fixes: 6082d9c9c94a ("net/mlx5: Fix mlx5_get_vector_affinity function")
Signed-off-by: Saeed Mahameed 
CC: Randy Dunlap 
CC: Guenter Roeck 
CC: Thomas Gleixner 
Tested-by: Israel Rukshin 
---

For -stable v4.14

 include/linux/mlx5/driver.h | 12 +---
 1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 2a156c5dfadd..d703774982ca 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -1286,17 +1286,7 @@ enum {
 static inline const struct cpumask *
 mlx5_get_vector_affinity_hint(struct mlx5_core_dev *dev, int vector)
 {
-   struct irq_desc *desc;
-   unsigned int irq;
-   int eqn;
-   int err;
-
-   err = mlx5_vector2eqn(dev, vector, &eqn, &irq);
-   if (err)
-   return NULL;
-
-   desc = irq_to_desc(irq);
-   return desc->affinity_hint;
+   return dev->priv.irq_info[vector].mask;
 }
 
 #endif /* MLX5_DRIVER_H */
-- 
2.17.0

Re: [PATCH] net/mlx5: Use 'kvfree()' for memory allocated by 'kvzalloc()'

2018-05-14 Thread Saeed Mahameed

On Mon, May 14, 2018 at 11:56 AM, David Miller  wrote:
> From: Christophe JAILLET 
> Date: Sat, 12 May 2018 19:09:25 +0200
>
>> 'out' is allocated with 'kvzalloc()'. 'kvfree()' must be used to free it.
>>
>> Signed-off-by: Christophe JAILLET 
>
> Saeed, I assume I will see this in one of your forthcoming pull
> requests.
>
> Thanks.

In case this is for net-next,  I will apply v3 to mlx5-next once
Christophe adds the "Fixes" tags according to Eric's request.
if it is for net (RC) then you can go ahead and apply v3 to net branch.

Thanks,
Saeed.

[PATCH net-stable 03/24] hv_netvsc: Rename tx_send_table to tx_table

2018-05-14 Thread Stephen Hemminger

From: Haiyang Zhang 

commit 39e91cfbf6f5fb26ba64cc2e8874372baf1671e7 upstream.

Simplify the variable name: tx_send_table

Signed-off-by: Haiyang Zhang 
Signed-off-by: David S. Miller 
---
 drivers/net/hyperv/hyperv_net.h | 2 +-
 drivers/net/hyperv/netvsc.c | 2 +-
 drivers/net/hyperv/netvsc_drv.c | 4 ++--
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 8144b938460f..94de7ca1e52e 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -734,7 +734,7 @@ struct net_device_context {
 
u32 tx_checksum_mask;
 
-   u32 tx_send_table[VRSS_SEND_TAB_SIZE];
+   u32 tx_table[VRSS_SEND_TAB_SIZE];
 
/* Ethtool settings */
bool udp4_l4_hash;
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index a6bafcf55776..03b44ec805db 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -1107,7 +1107,7 @@ static void netvsc_send_table(struct hv_device *hdev,
  nvmsg->msg.v5_msg.send_table.offset);
 
for (i = 0; i < count; i++)
-   net_device_ctx->tx_send_table[i] = tab[i];
+   net_device_ctx->tx_table[i] = tab[i];
 }
 
 static void netvsc_send_vf(struct net_device_context *net_device_ctx,
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 8b358d3fc057..f5d299b82bac 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -234,8 +234,8 @@ static inline int netvsc_get_tx_queue(struct net_device 
*ndev,
struct sock *sk = skb->sk;
int q_idx;
 
-   q_idx = ndc->tx_send_table[netvsc_get_hash(skb, ndc) &
-  (VRSS_SEND_TAB_SIZE - 1)];
+   q_idx = ndc->tx_table[netvsc_get_hash(skb, ndc) &
+ (VRSS_SEND_TAB_SIZE - 1)];
 
/* If queue index changed record the new value */
if (q_idx != old_idx &&
-- 
2.17.0

[PATCH net-stable 05/24] hv_netvsc: Set tx_table to equal weight after subchannels open

2018-05-14 Thread Stephen Hemminger

From: Haiyang Zhang 

commit a6fb6aa3cfa9047b62653dbcfc9bcde6e2272b41 upstream.

In some cases, like internal vSwitch, the host doesn't provide
send indirection table updates. This patch sets the table to be
equal weight after subchannels are all open. Otherwise, all workload
will be on one TX channel.

As tested, this patch has largely increased the throughput over
internal vSwitch.

Signed-off-by: Haiyang Zhang 
Signed-off-by: David S. Miller 
---
 drivers/net/hyperv/rndis_filter.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/hyperv/rndis_filter.c 
b/drivers/net/hyperv/rndis_filter.c
index addf9f69c58c..0648eebda829 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -1114,6 +1114,9 @@ void rndis_set_subchannel(struct work_struct *w)
netif_set_real_num_tx_queues(ndev, nvdev->num_chn);
netif_set_real_num_rx_queues(ndev, nvdev->num_chn);
 
+   for (i = 0; i < VRSS_SEND_TAB_SIZE; i++)
+   ndev_ctx->tx_table[i] = i % nvdev->num_chn;
+
rtnl_unlock();
return;
 
-- 
2.17.0

[PATCH net-stable 09/24] hv_netvsc: Use the num_online_cpus() for channel limit

2018-05-14 Thread Stephen Hemminger

From: Haiyang Zhang 

commit 25a39f7f975c3c26a0052fbf9b59201c06744332 upstream

Since we no longer localize channel/CPU affiliation within one NUMA
node, num_online_cpus() is used as the number of channel cap, instead of
the number of processors in a NUMA node.

This patch allows a bigger range for tuning the number of channels.

Signed-off-by: Haiyang Zhang 
Signed-off-by: David S. Miller 
---
 drivers/net/hyperv/rndis_filter.c | 11 ++-
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/drivers/net/hyperv/rndis_filter.c 
b/drivers/net/hyperv/rndis_filter.c
index 0c99d9926085..05109ed5377c 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -1221,7 +1221,6 @@ struct netvsc_device *rndis_filter_device_add(struct 
hv_device *dev,
struct ndis_recv_scale_cap rsscap;
u32 rsscap_size = sizeof(struct ndis_recv_scale_cap);
u32 mtu, size;
-   const struct cpumask *node_cpu_mask;
u32 num_possible_rss_qs;
int i, ret;
 
@@ -1290,14 +1289,8 @@ struct netvsc_device *rndis_filter_device_add(struct 
hv_device *dev,
if (ret || rsscap.num_recv_que < 2)
goto out;
 
-   /*
-* We will limit the VRSS channels to the number CPUs in the NUMA node
-* the primary channel is currently bound to.
-*
-* This also guarantees that num_possible_rss_qs <= num_online_cpus
-*/
-   node_cpu_mask = cpumask_of_node(cpu_to_node(dev->channel->target_cpu));
-   num_possible_rss_qs = min_t(u32, cpumask_weight(node_cpu_mask),
+   /* This guarantees that num_possible_rss_qs <= num_online_cpus */
+   num_possible_rss_qs = min_t(u32, num_online_cpus(),
rsscap.num_recv_que);
 
net_device->max_chn = min_t(u32, VRSS_CHANNEL_MAX, num_possible_rss_qs);
-- 
2.17.0

[PATCH net-stable 10/24] hv_netvsc: avoid retry on send during shutdown

2018-05-14 Thread Stephen Hemminger

From: Stephen Hemminger 

commit 12f69661a49446840d742d8feb593ace022d9f66 upstream

Change the initialization order so that the device is ready to transmit
(ie connect vsp is completed) before setting the internal reference
to the device with RCU.

This avoids any races on initialization and prevents retry issues
on shutdown.

Signed-off-by: Stephen Hemminger 
Signed-off-by: David S. Miller 
---
 drivers/net/hyperv/netvsc.c | 24 +++-
 1 file changed, 7 insertions(+), 17 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 22da6399b37a..839b9d6ecb41 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -850,13 +850,6 @@ int netvsc_send(struct net_device *ndev,
if (unlikely(!net_device || net_device->destroy))
return -ENODEV;
 
-   /* We may race with netvsc_connect_vsp()/netvsc_init_buf() and get
-* here before the negotiation with the host is finished and
-* send_section_map may not be allocated yet.
-*/
-   if (unlikely(!net_device->send_section_map))
-   return -EAGAIN;
-
nvchan = &net_device->chan_table[packet->q_idx];
packet->send_buf_index = NETVSC_INVALID_INDEX;
packet->cp_partial = false;
@@ -864,10 +857,8 @@ int netvsc_send(struct net_device *ndev,
/* Send control message directly without accessing msd (Multi-Send
 * Data) field which may be changed during data packet processing.
 */
-   if (!skb) {
-   cur_send = packet;
-   goto send_now;
-   }
+   if (!skb)
+   return netvsc_send_pkt(device, packet, net_device, pb, skb);
 
/* batch packets in send buffer if possible */
msdp = &nvchan->msd;
@@ -951,7 +942,6 @@ int netvsc_send(struct net_device *ndev,
}
}
 
-send_now:
if (cur_send)
ret = netvsc_send_pkt(device, cur_send, net_device, pb, skb);
 
@@ -1308,11 +1298,6 @@ struct netvsc_device *netvsc_device_add(struct hv_device 
*device,
 
napi_enable(&net_device->chan_table[0].napi);
 
-   /* Writing nvdev pointer unlocks netvsc_send(), make sure chn_table is
-* populated.
-*/
-   rcu_assign_pointer(net_device_ctx->nvdev, net_device);
-
/* Connect with the NetVsp */
ret = netvsc_connect_vsp(device, net_device, device_info);
if (ret != 0) {
@@ -1321,6 +1306,11 @@ struct netvsc_device *netvsc_device_add(struct hv_device 
*device,
goto close;
}
 
+   /* Writing nvdev pointer unlocks netvsc_send(), make sure chn_table is
+* populated.
+*/
+   rcu_assign_pointer(net_device_ctx->nvdev, net_device);
+
return net_device;
 
 close:
-- 
2.17.0

[PATCH net-stable 02/24] hv_netvsc: Rename ind_table to rx_table

2018-05-14 Thread Stephen Hemminger

From: Haiyang Zhang 

commit 47371300dfc269dd8d150e5b872bdbbda98ba809 upstream.

Rename this variable because it is the Receive indirection
table.

Signed-off-by: Haiyang Zhang 
Signed-off-by: David S. Miller 
---
 drivers/net/hyperv/hyperv_net.h   | 2 +-
 drivers/net/hyperv/netvsc_drv.c   | 4 ++--
 drivers/net/hyperv/rndis_filter.c | 6 +++---
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 4f3afcf92a7c..8144b938460f 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -179,7 +179,7 @@ struct rndis_device {
 
u8 hw_mac_adr[ETH_ALEN];
u8 rss_key[NETVSC_HASH_KEYLEN];
-   u16 ind_table[ITAB_NUM];
+   u16 rx_table[ITAB_NUM];
 };
 
 
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index e40be71df9d6..8b358d3fc057 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -1378,7 +1378,7 @@ static int netvsc_get_rxfh(struct net_device *dev, u32 
*indir, u8 *key,
rndis_dev = ndev->extension;
if (indir) {
for (i = 0; i < ITAB_NUM; i++)
-   indir[i] = rndis_dev->ind_table[i];
+   indir[i] = rndis_dev->rx_table[i];
}
 
if (key)
@@ -1408,7 +1408,7 @@ static int netvsc_set_rxfh(struct net_device *dev, const 
u32 *indir,
return -EINVAL;
 
for (i = 0; i < ITAB_NUM; i++)
-   rndis_dev->ind_table[i] = indir[i];
+   rndis_dev->rx_table[i] = indir[i];
}
 
if (!key) {
diff --git a/drivers/net/hyperv/rndis_filter.c 
b/drivers/net/hyperv/rndis_filter.c
index 065b204d8e17..addf9f69c58c 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -759,7 +759,7 @@ int rndis_filter_set_rss_param(struct rndis_device *rdev,
/* Set indirection table entries */
itab = (u32 *)(rssp + 1);
for (i = 0; i < ITAB_NUM; i++)
-   itab[i] = rdev->ind_table[i];
+   itab[i] = rdev->rx_table[i];
 
/* Set hask key values */
keyp = (u8 *)((unsigned long)rssp + rssp->kashkey_offset);
@@ -1284,8 +1284,8 @@ struct netvsc_device *rndis_filter_device_add(struct 
hv_device *dev,
net_device->num_chn = min(net_device->max_chn, device_info->num_chn);
 
for (i = 0; i < ITAB_NUM; i++)
-   rndis_device->ind_table[i] = ethtool_rxfh_indir_default(i,
-   net_device->num_chn);
+   rndis_device->rx_table[i] = ethtool_rxfh_indir_default(
+   i, net_device->num_chn);
 
atomic_set(&net_device->open_chn, 1);
vmbus_set_sc_create_callback(dev->channel, netvsc_sc_open);
-- 
2.17.0

[PATCH net-stable 04/24] hv_netvsc: Add initialization of tx_table in netvsc_device_add()

2018-05-14 Thread Stephen Hemminger

From: Haiyang Zhang 

commit 6b0cbe315868d613123cf387052ccda5f09d49ea upstream.

tx_table is part of the private data of kernel net_device. It is only
zero-ed out when allocating net_device.

We may recreate netvsc_device w/o recreating net_device, so the private
netdev data, including tx_table, are not zeroed. It may contain channel
numbers for the older netvsc_device.

This patch adds initialization of tx_table each time we recreate
netvsc_device.

Signed-off-by: Haiyang Zhang 
Signed-off-by: David S. Miller 
---
 drivers/net/hyperv/netvsc.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 03b44ec805db..73999214d444 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -1252,6 +1252,9 @@ struct netvsc_device *netvsc_device_add(struct hv_device 
*device,
if (!net_device)
return ERR_PTR(-ENOMEM);
 
+   for (i = 0; i < VRSS_SEND_TAB_SIZE; i++)
+   net_device_ctx->tx_table[i] = 0;
+
net_device->ring_size = ring_size;
 
/* Because the device uses NAPI, all the interrupt batching and
-- 
2.17.0

[PATCH net-stable 14/24] hv_netvsc: fix race in napi poll when rescheduling

2018-05-14 Thread Stephen Hemminger

From: Stephen Hemminger 

commit d64e38ae690e3337db0d38d9b149a193a1646c4b upstream

There is a race between napi_reschedule and re-enabling interrupts
which could lead to missed host interrrupts.  This occurs when
interrupts are re-enabled (hv_end_read) and vmbus irq callback
(netvsc_channel_cb) has already scheduled NAPI.

Signed-off-by: Stephen Hemminger 
Signed-off-by: David S. Miller 
---
 drivers/net/hyperv/netvsc.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index b7720b65f5d6..7e0eb99571af 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -1205,9 +1205,10 @@ int netvsc_poll(struct napi_struct *napi, int budget)
if (send_recv_completions(ndev, net_device, nvchan) == 0 &&
work_done < budget &&
napi_complete_done(napi, work_done) &&
-   hv_end_read(&channel->inbound)) {
+   hv_end_read(&channel->inbound) &&
+   napi_schedule_prep(napi)) {
hv_begin_read(&channel->inbound);
-   napi_reschedule(napi);
+   __napi_schedule(napi);
}
 
/* Driver may overshoot since multiple packets per descriptor */
-- 
2.17.0

[PATCH net-stable 12/24] hv_netvsc: fix error unwind handling if vmbus_open fails

2018-05-14 Thread Stephen Hemminger

From: Stephen Hemminger 

commit fcfb4a00d1e514e8313277a01ef919de1113025b upstream

Need to delete NAPI association if vmbus_open fails.

Signed-off-by: Stephen Hemminger 
Signed-off-by: David S. Miller 
---
 drivers/net/hyperv/netvsc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 839b9d6ecb41..b7720b65f5d6 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -1288,7 +1288,6 @@ struct netvsc_device *netvsc_device_add(struct hv_device 
*device,
 net_device->chan_table);
 
if (ret != 0) {
-   netif_napi_del(&net_device->chan_table[0].napi);
netdev_err(ndev, "unable to open channel: %d\n", ret);
goto cleanup;
}
@@ -1321,6 +1320,7 @@ struct netvsc_device *netvsc_device_add(struct hv_device 
*device,
vmbus_close(device->channel);
 
 cleanup:
+   netif_napi_del(&net_device->chan_table[0].napi);
free_netvsc_device(&net_device->rcu);
 
return ERR_PTR(ret);
-- 
2.17.0

[PATCH net-stable 07/24] hv_netvsc: preserve hw_features on mtu/channels/ringparam changes

2018-05-14 Thread Stephen Hemminger

From: Vitaly Kuznetsov 

commit aefd80e874e98a864915df5b7d90824a4340b450 upstream.

rndis_filter_device_add() is called both from netvsc_probe() when we
initially create the device and from set channels/mtu/ringparam
routines where we basically remove the device and add it back.

hw_features is reset in rndis_filter_device_add() and filled with
host data. However, we lose all additional flags which are set outside
of the driver, e.g. register_netdevice() adds NETIF_F_SOFT_FEATURES and
many others.

Unfortunately, calls to rndis_{query_hwcaps(), _set_offload_params()}
calls cannot be avoided on every RNDIS reset: host expects us to set
required features explicitly. Moreover, in theory hardware capabilities
can change and we need to reflect the change in hw_features.

Reset net->hw_features bits according to host data in
rndis_netdev_set_hwcaps(), clear corresponding feature bits
from net->features in case some features went missing (will never happen
in real life I guess but let's be consistent).

Signed-off-by: Vitaly Kuznetsov 
Reviewed-by: Haiyang Zhang 
Signed-off-by: David S. Miller 
---
 drivers/net/hyperv/hyperv_net.h   |   4 +
 drivers/net/hyperv/netvsc_drv.c   |   2 +-
 drivers/net/hyperv/rndis_filter.c | 136 +-
 3 files changed, 83 insertions(+), 59 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 94de7ca1e52e..a3f628c3c9ed 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -659,6 +659,10 @@ struct nvsp_message {
 #define NETVSC_RECEIVE_BUFFER_ID   0xcafe
 #define NETVSC_SEND_BUFFER_ID  0
 
+#define NETVSC_SUPPORTED_HW_FEATURES (NETIF_F_RXCSUM | NETIF_F_IP_CSUM | \
+ NETIF_F_TSO | NETIF_F_IPV6_CSUM | \
+ NETIF_F_TSO6)
+
 #define VRSS_SEND_TAB_SIZE 16  /* must be power of 2 */
 #define VRSS_CHANNEL_MAX 64
 #define VRSS_CHANNEL_DEFAULT 8
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index f5d299b82bac..cfaf433b0cda 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -1955,7 +1955,7 @@ static int netvsc_probe(struct hv_device *dev,
 
memcpy(net->dev_addr, device_info.mac_adr, ETH_ALEN);
 
-   /* hw_features computed in rndis_filter_device_add */
+   /* hw_features computed in rndis_netdev_set_hwcaps() */
net->features = net->hw_features |
NETIF_F_HIGHDMA | NETIF_F_SG |
NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX;
diff --git a/drivers/net/hyperv/rndis_filter.c 
b/drivers/net/hyperv/rndis_filter.c
index 0648eebda829..be57639bee29 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -1131,69 +1131,20 @@ void rndis_set_subchannel(struct work_struct *w)
rtnl_unlock();
 }
 
-struct netvsc_device *rndis_filter_device_add(struct hv_device *dev,
- struct netvsc_device_info *device_info)
+static int rndis_netdev_set_hwcaps(struct rndis_device *rndis_device,
+  struct netvsc_device *nvdev)
 {
-   struct net_device *net = hv_get_drvdata(dev);
+   struct net_device *net = rndis_device->ndev;
struct net_device_context *net_device_ctx = netdev_priv(net);
-   struct netvsc_device *net_device;
-   struct rndis_device *rndis_device;
struct ndis_offload hwcaps;
struct ndis_offload_params offloads;
-   struct ndis_recv_scale_cap rsscap;
-   u32 rsscap_size = sizeof(struct ndis_recv_scale_cap);
unsigned int gso_max_size = GSO_MAX_SIZE;
-   u32 mtu, size;
-   const struct cpumask *node_cpu_mask;
-   u32 num_possible_rss_qs;
-   int i, ret;
-
-   rndis_device = get_rndis_device();
-   if (!rndis_device)
-   return ERR_PTR(-ENODEV);
-
-   /*
-* Let the inner driver handle this first to create the netvsc channel
-* NOTE! Once the channel is created, we may get a receive callback
-* (RndisFilterOnReceive()) before this call is completed
-*/
-   net_device = netvsc_device_add(dev, device_info);
-   if (IS_ERR(net_device)) {
-   kfree(rndis_device);
-   return net_device;
-   }
-
-   /* Initialize the rndis device */
-   net_device->max_chn = 1;
-   net_device->num_chn = 1;
-
-   net_device->extension = rndis_device;
-   rndis_device->ndev = net;
-
-   /* Send the rndis initialization message */
-   ret = rndis_filter_init_device(rndis_device, net_device);
-   if (ret != 0)
-   goto err_dev_remv;
-
-   /* Get the MTU from the host */
-   size = sizeof(u32);
-   ret = rndis_filter_query_device(rndis_device, net_device,
-   RNDIS_OID_GEN_MAXIMUM_FRAME_SIZE,
-   &mtu, &size);
-   if (ret == 0 &

[PATCH net-stable 08/24] hv_netvsc: empty current transmit aggregation if flow blocked

2018-05-14 Thread Stephen Hemminger

From: Stephen Hemminger 

commit cfd8afd986cdb59ea9adac873c5082498a1eb7c0 upstream

If the transmit queue is known full, then don't keep aggregating
data. And the cp_partial flag which indicates that the current
aggregation buffer is full can be folded in to avoid more
conditionals.

Signed-off-by: Stephen Hemminger 
Signed-off-by: David S. Miller 
---
 drivers/net/hyperv/hyperv_net.h   |  2 +-
 drivers/net/hyperv/netvsc.c   | 36 ++-
 drivers/net/hyperv/netvsc_drv.c   |  2 +-
 drivers/net/hyperv/rndis_filter.c |  3 +--
 4 files changed, 24 insertions(+), 19 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index a3f628c3c9ed..fd51a329e36e 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -192,7 +192,7 @@ struct netvsc_device *netvsc_device_add(struct hv_device 
*device,
const struct netvsc_device_info *info);
 int netvsc_alloc_recv_comp_ring(struct netvsc_device *net_device, u32 q_idx);
 void netvsc_device_remove(struct hv_device *device);
-int netvsc_send(struct net_device_context *ndc,
+int netvsc_send(struct net_device *net,
struct hv_netvsc_packet *packet,
struct rndis_message *rndis_msg,
struct hv_page_buffer *page_buffer,
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 4bc8a1d529d9..22da6399b37a 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -700,13 +700,13 @@ static u32 netvsc_get_next_send_section(struct 
netvsc_device *net_device)
return NETVSC_INVALID_INDEX;
 }
 
-static u32 netvsc_copy_to_send_buf(struct netvsc_device *net_device,
-  unsigned int section_index,
-  u32 pend_size,
-  struct hv_netvsc_packet *packet,
-  struct rndis_message *rndis_msg,
-  struct hv_page_buffer *pb,
-  struct sk_buff *skb)
+static void netvsc_copy_to_send_buf(struct netvsc_device *net_device,
+   unsigned int section_index,
+   u32 pend_size,
+   struct hv_netvsc_packet *packet,
+   struct rndis_message *rndis_msg,
+   struct hv_page_buffer *pb,
+   bool xmit_more)
 {
char *start = net_device->send_buf;
char *dest = start + (section_index * net_device->send_section_size)
@@ -719,7 +719,8 @@ static u32 netvsc_copy_to_send_buf(struct netvsc_device 
*net_device,
packet->page_buf_cnt;
 
/* Add padding */
-   if (skb->xmit_more && remain && !packet->cp_partial) {
+   remain = packet->total_data_buflen & (net_device->pkt_align - 1);
+   if (xmit_more && remain) {
padding = net_device->pkt_align - remain;
rndis_msg->msg_len += padding;
packet->total_data_buflen += padding;
@@ -739,8 +740,6 @@ static u32 netvsc_copy_to_send_buf(struct netvsc_device 
*net_device,
memset(dest, 0, padding);
msg_size += padding;
}
-
-   return msg_size;
 }
 
 static inline int netvsc_send_pkt(
@@ -828,12 +827,13 @@ static inline void move_pkt_msd(struct hv_netvsc_packet 
**msd_send,
 }
 
 /* RCU already held by caller */
-int netvsc_send(struct net_device_context *ndev_ctx,
+int netvsc_send(struct net_device *ndev,
struct hv_netvsc_packet *packet,
struct rndis_message *rndis_msg,
struct hv_page_buffer *pb,
struct sk_buff *skb)
 {
+   struct net_device_context *ndev_ctx = netdev_priv(ndev);
struct netvsc_device *net_device
= rcu_dereference_bh(ndev_ctx->nvdev);
struct hv_device *device = ndev_ctx->device_ctx;
@@ -844,8 +844,7 @@ int netvsc_send(struct net_device_context *ndev_ctx,
struct multi_send_data *msdp;
struct hv_netvsc_packet *msd_send = NULL, *cur_send = NULL;
struct sk_buff *msd_skb = NULL;
-   bool try_batch;
-   bool xmit_more = (skb != NULL) ? skb->xmit_more : false;
+   bool try_batch, xmit_more;
 
/* If device is rescinded, return error and packet will get dropped. */
if (unlikely(!net_device || net_device->destroy))
@@ -896,10 +895,17 @@ int netvsc_send(struct net_device_context *ndev_ctx,
}
}
 
+   /* Keep aggregating only if stack says more data is coming
+* and not doing mixed modes send and not flow blocked
+*/
+   xmit_more = skb->xmit_more &&
+   !packet->cp_partial &&
+   !netif_xmit_stopped(netdev_get_tx_queue(ndev, packet->q_idx));
+
if (section_index != NETVSC_INVALID_INDEX) {
netvsc_copy_to_send_

[PATCH net-stable 21/24] hv_netvsc: Split netvsc_revoke_buf() and netvsc_teardown_gpadl()

2018-05-14 Thread Stephen Hemminger

From: Mohammed Gamal 

commit 7992894c305eaf504d005529637ff8283d0a849d upstream

Split each of the functions into two for each of send/recv buffers.
This will be needed in order to implement a fine-grained messaging
sequence to the host so that we accommodate the requirements of
different Windows versions

Fixes: 0ef58b0a05c12 ("hv_netvsc: change GPAD teardown order on older versions")
Signed-off-by: Mohammed Gamal 
Signed-off-by: David S. Miller 
---
 drivers/net/hyperv/netvsc.c | 46 +++--
 1 file changed, 34 insertions(+), 12 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index aa9b7a912c31..25fcba506ac5 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -105,11 +105,11 @@ static void free_netvsc_device_rcu(struct netvsc_device 
*nvdev)
call_rcu(&nvdev->rcu, free_netvsc_device);
 }
 
-static void netvsc_revoke_buf(struct hv_device *device,
- struct netvsc_device *net_device)
+static void netvsc_revoke_recv_buf(struct hv_device *device,
+  struct netvsc_device *net_device)
 {
-   struct nvsp_message *revoke_packet;
struct net_device *ndev = hv_get_drvdata(device);
+   struct nvsp_message *revoke_packet;
int ret;
 
/*
@@ -151,6 +151,14 @@ static void netvsc_revoke_buf(struct hv_device *device,
}
net_device->recv_section_cnt = 0;
}
+}
+
+static void netvsc_revoke_send_buf(struct hv_device *device,
+  struct netvsc_device *net_device)
+{
+   struct net_device *ndev = hv_get_drvdata(device);
+   struct nvsp_message *revoke_packet;
+   int ret;
 
/* Deal with the send buffer we may have setup.
 * If we got a  send section size, it means we received a
@@ -194,8 +202,8 @@ static void netvsc_revoke_buf(struct hv_device *device,
}
 }
 
-static void netvsc_teardown_gpadl(struct hv_device *device,
- struct netvsc_device *net_device)
+static void netvsc_teardown_recv_gpadl(struct hv_device *device,
+  struct netvsc_device *net_device)
 {
struct net_device *ndev = hv_get_drvdata(device);
int ret;
@@ -214,6 +222,13 @@ static void netvsc_teardown_gpadl(struct hv_device *device,
}
net_device->recv_buf_gpadl_handle = 0;
}
+}
+
+static void netvsc_teardown_send_gpadl(struct hv_device *device,
+  struct netvsc_device *net_device)
+{
+   struct net_device *ndev = hv_get_drvdata(device);
+   int ret;
 
if (net_device->send_buf_gpadl_handle) {
ret = vmbus_teardown_gpadl(device->channel,
@@ -423,8 +438,10 @@ static int netvsc_init_buf(struct hv_device *device,
goto exit;
 
 cleanup:
-   netvsc_revoke_buf(device, net_device);
-   netvsc_teardown_gpadl(device, net_device);
+   netvsc_revoke_recv_buf(device, net_device);
+   netvsc_revoke_send_buf(device, net_device);
+   netvsc_teardown_recv_gpadl(device, net_device);
+   netvsc_teardown_send_gpadl(device, net_device);
 
 exit:
return ret;
@@ -554,7 +571,8 @@ void netvsc_device_remove(struct hv_device *device)
= rtnl_dereference(net_device_ctx->nvdev);
int i;
 
-   netvsc_revoke_buf(device, net_device);
+   netvsc_revoke_recv_buf(device, net_device);
+   netvsc_revoke_send_buf(device, net_device);
 
RCU_INIT_POINTER(net_device_ctx->nvdev, NULL);
 
@@ -569,14 +587,18 @@ void netvsc_device_remove(struct hv_device *device)
netdev_dbg(ndev, "net device safe to remove\n");
 
/* older versions require that buffer be revoked before close */
-   if (vmbus_proto_version < VERSION_WIN10)
-   netvsc_teardown_gpadl(device, net_device);
+   if (vmbus_proto_version < VERSION_WIN10) {
+   netvsc_teardown_recv_gpadl(device, net_device);
+   netvsc_teardown_send_gpadl(device, net_device);
+   }
 
/* Now, we can close the channel safely */
vmbus_close(device->channel);
 
-   if (vmbus_proto_version >= VERSION_WIN10)
-   netvsc_teardown_gpadl(device, net_device);
+   if (vmbus_proto_version >= VERSION_WIN10) {
+   netvsc_teardown_recv_gpadl(device, net_device);
+   netvsc_teardown_send_gpadl(device, net_device);
+   }
 
/* Release all resources */
free_netvsc_device_rcu(net_device);
-- 
2.17.0

[PATCH net-stable 17/24] hv_netvsc: use RCU to fix concurrent rx and queue changes

2018-05-14 Thread Stephen Hemminger

From: Stephen Hemminger 

commit 02400fcee2542ee334a2394e0d9f6efd969fe782 upstream

The receive processing may continue to happen while the
internal network device state is in RCU grace period.
The internal RNDIS structure is associated with the
internal netvsc_device structure; both have the same
RCU lifetime.

Defer freeing all associated parts until after grace
period.

Fixes: 0cf737808ae7 ("hv_netvsc: netvsc_teardown_gpadl() split")
Signed-off-by: Stephen Hemminger 
Signed-off-by: David S. Miller 
---
 drivers/net/hyperv/netvsc.c   | 17 --
 drivers/net/hyperv/rndis_filter.c | 39 ++-
 2 files changed, 22 insertions(+), 34 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index a8789dc4fc9b..03328a60377e 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -89,6 +89,11 @@ static void free_netvsc_device(struct rcu_head *head)
= container_of(head, struct netvsc_device, rcu);
int i;
 
+   kfree(nvdev->extension);
+   vfree(nvdev->recv_buf);
+   vfree(nvdev->send_buf);
+   kfree(nvdev->send_section_map);
+
for (i = 0; i < VRSS_CHANNEL_MAX; i++)
vfree(nvdev->chan_table[i].mrc.slots);
 
@@ -210,12 +215,6 @@ static void netvsc_teardown_gpadl(struct hv_device *device,
net_device->recv_buf_gpadl_handle = 0;
}
 
-   if (net_device->recv_buf) {
-   /* Free up the receive buffer */
-   vfree(net_device->recv_buf);
-   net_device->recv_buf = NULL;
-   }
-
if (net_device->send_buf_gpadl_handle) {
ret = vmbus_teardown_gpadl(device->channel,
   net_device->send_buf_gpadl_handle);
@@ -230,12 +229,6 @@ static void netvsc_teardown_gpadl(struct hv_device *device,
}
net_device->send_buf_gpadl_handle = 0;
}
-   if (net_device->send_buf) {
-   /* Free up the send buffer */
-   vfree(net_device->send_buf);
-   net_device->send_buf = NULL;
-   }
-   kfree(net_device->send_section_map);
 }
 
 int netvsc_alloc_recv_comp_ring(struct netvsc_device *net_device, u32 q_idx)
diff --git a/drivers/net/hyperv/rndis_filter.c 
b/drivers/net/hyperv/rndis_filter.c
index 5a312b2d5a7b..056499014a1f 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -266,13 +266,23 @@ static void rndis_set_link_state(struct rndis_device 
*rdev,
}
 }
 
-static void rndis_filter_receive_response(struct rndis_device *dev,
-  struct rndis_message *resp)
+static void rndis_filter_receive_response(struct net_device *ndev,
+ struct netvsc_device *nvdev,
+ const struct rndis_message *resp)
 {
+   struct rndis_device *dev = nvdev->extension;
struct rndis_request *request = NULL;
bool found = false;
unsigned long flags;
-   struct net_device *ndev = dev->ndev;
+
+   /* This should never happen, it means control message
+* response received after device removed.
+*/
+   if (dev->state == RNDIS_DEV_UNINITIALIZED) {
+   netdev_err(ndev,
+  "got rndis message uninitialized\n");
+   return;
+   }
 
spin_lock_irqsave(&dev->request_lock, flags);
list_for_each_entry(request, &dev->req_list, list_ent) {
@@ -353,7 +363,7 @@ static inline void *rndis_get_ppi(struct rndis_packet 
*rpkt, u32 type)
 }
 
 static int rndis_filter_receive_data(struct net_device *ndev,
-struct rndis_device *dev,
+struct netvsc_device *nvdev,
 struct rndis_message *msg,
 struct vmbus_channel *channel,
 void *data, u32 data_buflen)
@@ -373,7 +383,7 @@ static int rndis_filter_receive_data(struct net_device 
*ndev,
 * should be the data packet size plus the trailer padding size
 */
if (unlikely(data_buflen < rndis_pkt->data_len)) {
-   netdev_err(dev->ndev, "rndis message buffer "
+   netdev_err(ndev, "rndis message buffer "
   "overflow detected (got %u, min %u)"
   "...dropping this message!\n",
   data_buflen, rndis_pkt->data_len);
@@ -401,34 +411,20 @@ int rndis_filter_receive(struct net_device *ndev,
 void *data, u32 buflen)
 {
struct net_device_context *net_device_ctx = netdev_priv(ndev);
-   struct rndis_device *rndis_dev = net_dev->extension;
struct rndis_message *rndis_msg = data;
 
-   /* Make sure the rndis device state is initialized */
-   if (unlikely(!rndis_dev)) {
-   netif_er

[PATCH net-stable 18/24] hv_netvsc: change GPAD teardown order on older versions

2018-05-14 Thread Stephen Hemminger

From: Stephen Hemminger 

commit 0ef58b0a05c127762f975c3dfe8b922e4aa87a29 upstream

On older versions of Windows, the host ignores messages after
vmbus channel is closed.

Workaround this by doing what Windows does and send the teardown
before close on older versions of NVSP protocol.

Reported-by: Mohammed Gamal 
Fixes: 0cf737808ae7 ("hv_netvsc: netvsc_teardown_gpadl() split")
Signed-off-by: Stephen Hemminger 
Signed-off-by: David S. Miller 
---
 drivers/net/hyperv/netvsc.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 03328a60377e..3f3c6489ade2 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -570,10 +570,15 @@ void netvsc_device_remove(struct hv_device *device)
 */
netdev_dbg(ndev, "net device safe to remove\n");
 
+   /* older versions require that buffer be revoked before close */
+   if (net_device->nvsp_version < NVSP_PROTOCOL_VERSION_4)
+   netvsc_teardown_gpadl(device, net_device);
+
/* Now, we can close the channel safely */
vmbus_close(device->channel);
 
-   netvsc_teardown_gpadl(device, net_device);
+   if (net_device->nvsp_version >= NVSP_PROTOCOL_VERSION_4)
+   netvsc_teardown_gpadl(device, net_device);
 
/* Release all resources */
free_netvsc_device_rcu(net_device);
-- 
2.17.0

[PATCH net-stable 22/24] hv_netvsc: Ensure correct teardown message sequence order

2018-05-14 Thread Stephen Hemminger

From: Mohammed Gamal 

commit a56d99d714665591fed8527b90eef21530ea61e0 upstream

Prior to commit 0cf737808ae7 ("hv_netvsc: netvsc_teardown_gpadl() split")
the call sequence in netvsc_device_remove() was as follows (as
implemented in netvsc_destroy_buf()):
1- Send NVSP_MSG1_TYPE_REVOKE_RECV_BUF message
2- Teardown receive buffer GPADL
3- Send NVSP_MSG1_TYPE_REVOKE_SEND_BUF message
4- Teardown send buffer GPADL
5- Close vmbus

This didn't work for WS2016 hosts. Commit 0cf737808ae7
("hv_netvsc: netvsc_teardown_gpadl() split") rearranged the
teardown sequence as follows:
1- Send NVSP_MSG1_TYPE_REVOKE_RECV_BUF message
2- Send NVSP_MSG1_TYPE_REVOKE_SEND_BUF message
3- Close vmbus
4- Teardown receive buffer GPADL
5- Teardown send buffer GPADL

That worked well for WS2016 hosts, but it prevented guests on older hosts from
shutting down after changing network settings. Commit 0ef58b0a05c1
("hv_netvsc: change GPAD teardown order on older versions") ensured the
following message sequence for older hosts
1- Send NVSP_MSG1_TYPE_REVOKE_RECV_BUF message
2- Send NVSP_MSG1_TYPE_REVOKE_SEND_BUF message
3- Teardown receive buffer GPADL
4- Teardown send buffer GPADL
5- Close vmbus

However, with this sequence calling `ip link set eth0 mtu 1000` hangs and the
process becomes uninterruptible. On futher analysis it turns out that on tearing
down the receive buffer GPADL the kernel is waiting indefinitely
in vmbus_teardown_gpadl() for a completion to be signaled.

Here is a snippet of where this occurs:
int vmbus_teardown_gpadl(struct vmbus_channel *channel, u32 gpadl_handle)
{
struct vmbus_channel_gpadl_teardown *msg;
struct vmbus_channel_msginfo *info;
unsigned long flags;
int ret;

info = kmalloc(sizeof(*info) +
   sizeof(struct vmbus_channel_gpadl_teardown), GFP_KERNEL);
if (!info)
return -ENOMEM;

init_completion(&info->waitevent);
info->waiting_channel = channel;
[]
ret = vmbus_post_msg(msg, sizeof(struct vmbus_channel_gpadl_teardown),
 true);

if (ret)
goto post_msg_err;

wait_for_completion(&info->waitevent);
[]
}

The completion is signaled from vmbus_ongpadl_torndown(), which gets called when
the corresponding message is received from the host, which apparently never 
happens
in that case.
This patch works around the issue by restoring the first mentioned message 
sequence
for older hosts

Fixes: 0ef58b0a05c1 ("hv_netvsc: change GPAD teardown order on older versions")
Signed-off-by: Mohammed Gamal 
Signed-off-by: David S. Miller 
---
 drivers/net/hyperv/netvsc.c | 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 25fcba506ac5..99be63eacaeb 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -571,8 +571,17 @@ void netvsc_device_remove(struct hv_device *device)
= rtnl_dereference(net_device_ctx->nvdev);
int i;
 
+   /*
+* Revoke receive buffer. If host is pre-Win2016 then tear down
+* receive buffer GPADL. Do the same for send buffer.
+*/
netvsc_revoke_recv_buf(device, net_device);
+   if (vmbus_proto_version < VERSION_WIN10)
+   netvsc_teardown_recv_gpadl(device, net_device);
+
netvsc_revoke_send_buf(device, net_device);
+   if (vmbus_proto_version < VERSION_WIN10)
+   netvsc_teardown_send_gpadl(device, net_device);
 
RCU_INIT_POINTER(net_device_ctx->nvdev, NULL);
 
@@ -586,15 +595,13 @@ void netvsc_device_remove(struct hv_device *device)
 */
netdev_dbg(ndev, "net device safe to remove\n");
 
-   /* older versions require that buffer be revoked before close */
-   if (vmbus_proto_version < VERSION_WIN10) {
-   netvsc_teardown_recv_gpadl(device, net_device);
-   netvsc_teardown_send_gpadl(device, net_device);
-   }
-
/* Now, we can close the channel safely */
vmbus_close(device->channel);
 
+   /*
+* If host is Win2016 or higher then we do the GPADL tear down
+* here after VMBus is closed.
+   */
if (vmbus_proto_version >= VERSION_WIN10) {
netvsc_teardown_recv_gpadl(device, net_device);
netvsc_teardown_send_gpadl(device, net_device);
-- 
2.17.0

1 2 3 4 >

1 - 100 of 374 matches

Mail list logo