Re: [PATCH net-next 2/2 v2] netns: isolate seqnums to use per-netns locks

2018-04-24 Thread Eric W. Biederman
Christian Brauner  writes:

> Now that it's possible to have a different set of uevents in different
> network namespaces, per-network namespace uevent sequence numbers are
> introduced. This increases performance as locking is now restricted to the
> network namespace affected by the uevent rather than locking everything.
> Testing revealed significant performance improvements. For details see
> "Testing" below.

Maybe.  Your locking is wrong, and a few other things are wrong.  see
below.

> Since commit 692ec06 ("netns: send uevent messages") network namespaces not
> owned by the intial user namespace can be sent uevents from a sufficiently
> privileged userspace process.
> In order to send a uevent into a network namespace not owned by the initial
> user namespace we currently still need to take the *global mutex* that
> locks the uevent socket list even though the list *only contains network
> namespaces owned by the initial user namespace*. This needs to be done
> because the uevent counter is a global variable. Taking the global lock is
> performance sensitive since a user on the host can spawn a pool of n
> process that each create their own new user and network namespaces and then
> go on to inject uevents in parallel into the network namespace of all of
> these processes. This can have a significant performance impact for the
> host's udevd since it means that there can be a lot of delay between a
> device being added and the corresponding uevent being sent out and
> available for processing by udevd. It also means that each network
> namespace not owned by the initial user namespace which userspace has sent
> a uevent to will need to wait until the lock becomes available.
>
> Implementation:
> This patch gives each network namespace its own uevent sequence number.
> Each network namespace not owned by the initial user namespace receives its
> own mutex. The struct uevent_sock is opaque to callers outside of kobject.c
> so the mutex *can* and *is* only ever accessed in lib/kobject.c. In this
> file it is clearly documented which lock has to be taken. All network
> namespaces owned by the initial user namespace will still share the same
> lock since they are all served sequentially via the uevent socket list.
> This decouples the locking and ensures that the host retrieves uevents as
> fast as possible even if there are a lot of uevents injected into network
> namespaces not owned by the initial user namespace.  In addition, each
> network namespace not owned by the initial user namespace does not have to
> wait on any other network namespace not sharing the same user namespace.
>
> Testing:
> Two 4.17-rc1 test kernels were compiled. One with per netns uevent seqnums
> with decoupled locking and one without. To ensure that testing made sense
> both kernels carried the patch to remove network namespaces not owned by
> the initial user namespace from the uevent socket list.
> Three tests were constructed. All of them showed significant performance
> improvements with per-netns uevent sequence numbers and decoupled locking.
>
>  # Testcase 1:
>Only Injecting Uevents into network namespaces not owned by the initial
>user namespace.
>- created 1000 new user namespace + network namespace pairs
>- opened a uevent listener in each of those namespace pairs
>- injected uevents into each of those network namespaces 10,000 times
>  meaning 10,000,000 (10 million) uevents were injected. (The high
>  number of uevent injections should get rid of a lot of jitter.)
>  The injection was done by fork()ing 1000 uevent injectors in a simple
>  for-loop to ensure that uevents were injected in parallel.
>- mean transaction time was calculated:
>  - *without* uevent sequence number namespacing: 67 μs
>  - *with* uevent sequence number namespacing:55 μs
>  - makes a difference of:12 μs
>- a t-test was performed on the two data vectors which revealed
>  shows significant performance improvements:
>  Welch Two Sample t-test
>  data:  x1 and y1
>  t = 405.16, df = 18883000, p-value < 2.2e-16
>  alternative hypothesis: true difference in means is not equal to 0
>  95 percent confidence interval:
>  12.14949 12.26761
>  sample estimates:
>  mean of x mean of y
>  68.48594  56.27739
>
>  # Testcase 2:
>Injecting Uevents into network namespaces not owned by the initial user
>namespace and network namespaces owned by the initial user namespace.
>- created 500 new user namespace + network namespace pairs
>- created 500 new network namespace pairs
>- opened a uevent listener in each of those namespace pairs
>- injected uevents into each of those network namespaces 10,000 times
>  meaning 10,000,000 (10 million) uevents were injected. (The high
>  number of uevent injections should get rid of a lot of jitter.)
>  The injection was done by fork()ing 1000 uevent injectors in a 

Re: [PATCH bpf-next v5 0/2] bpf: allow map helpers access to map values directly

2018-04-24 Thread Daniel Borkmann
On 04/24/2018 03:06 PM, Paul Chaignon wrote:
> Currently, helpers that expect ARG_PTR_TO_MAP_KEY and ARG_PTR_TO_MAP_VALUE
> can only access stack and packet memory.  This patchset allows these
> helpers to directly access map values by passing registers of type
> PTR_TO_MAP_VALUE.
> 
> The first patch changes the verifier; the second adds new test cases.
> 
> The first three versions of this patchset were sent on the iovisor-dev
> mailing list only.
> 
> Changelogs:
>   Changes in v5:
> - Refactor using check_helper_mem_access.
>   Changes in v4:
> - Rebase.
>   Changes in v3:
> - Bug fixes.
> - Negative test cases.
>   Changes in v2:
> - Additional test cases for adjusted maps.

Applied to bpf-next, thanks Paul!



Re: [PATCH v2] selftests: bpf: update .gitignore with missing file

2018-04-24 Thread Daniel Borkmann
On 04/24/2018 12:53 AM, Anders Roxell wrote:
> Fixes: c0fa1b6c3efc ("bpf: btf: Add BTF tests")
> Signed-off-by: Anders Roxell 
> ---
> Rebased against bpf-next.

Applied to bpf-next, thanks Anders!


Re: [PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-24 Thread Eric W. Biederman

We already do this in practice in userspace.  It doesn't make much
sense to perform this delivery.  So we might as well make this optimization.

Christian Brauner  writes:
> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
>
> enabled sending hotplug events into all network namespaces back in 2010.
> Over time the set of uevents that get sent into all network namespaces has
> shrunk a little. We have now reached the point where hotplug events for all
> devices that carry a namespace tag are filtered according to that
> namespace. Specifically, they are filtered whenever the namespace tag of
> the kobject does not match the namespace tag of the netlink socket. One
> example are network devices. Uevents for network devices only show up in
> the network namespaces these devices are moved to or created in.
>
> However, any uevent for a kobject that does not have a namespace tag
> associated with it will not be filtered and we will broadcast it into all
> network namespaces. This behavior stopped making sense when user namespaces
> were introduced.
>
> This patch restricts uevents to the initial user namespace for a couple of
> reasons that have been extensively discusses on the mailing list [1].
> - Thundering herd:
>   Broadcasting uevents into all network namespaces introduces significant
>   overhead.
>   All processes that listen to uevents running in non-initial user
>   namespaces will end up responding to uevents that will be meaningless to
>   them. Mainly, because non-initial user namespaces cannot easily manage
>   devices unless they have a privileged host-process helping them out. This
>   means that there will be a thundering herd of activity when there
>   shouldn't be any.
> - Uevents from non-root users are already filtered in userspace:
>   Uevents are filtered by userspace in a user namespace because the
>   received uid != 0. Instead the uid associated with the event will be
>   65534 == "nobody" because the global root uid is not mapped.
>   This means we can safely and without introducing regressions modify the
>   kernel to not send uevents into all network namespaces whose owning user
>   namespace is not the initial user namespace because we know that
>   userspace will ignore the message because of the uid anyway. I have
>   a) verified that is is true for every udev implementation out there b)
>   that this behavior has been present in all udev implementations from the
>   very beginning.
> - Removing needless overhead/Increasing performance:
>   Currently, the uevent socket for each network namespace is added to the
>   global variable uevent_sock_list. The list itself needs to be protected
>   by a mutex. So everytime a uevent is generated the mutex is taken on the
>   list. The mutex is held *from the creation of the uevent (memory
>   allocation, string creation etc. until all uevent sockets have been
>   handled*. This is aggravated by the fact that for each uevent socket that
>   has listeners the mc_list must be walked as well which means we're
>   talking O(n^2) here. Given that a standard Linux workload usually has
>   quite a lot of network namespaces and - in the face of containers - a lot
>   of user namespaces this quickly becomes a performance problem (see
>   "Thundering herd" above). By just recording uevent sockets of network
>   namespaces that are owned by the initial user namespace we significantly
>   increase performance in this codepath.
> - Injecting uevents:
>   There's a valid argument that containers might be interested in receiving
>   device events especially if they are delegated to them by a privileged
>   userspace process. One prime example are SR-IOV enabled devices that are
>   explicitly designed to be handed of to other users such as VMs or
>   containers.
>   This use-case can now be correctly handled since
>   commit 692ec06d7c92 ("netns: send uevent messages"). This commit
>   introduced the ability to send uevents from userspace. As such we can let
>   a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of
>   the network namespace of the netlink socket) userspace process make a
>   decision what uevents should be sent. This removes the need to blindly
>   broadcast uevents into all user namespaces and provides a performant and
>   safe solution to this problem.
> - Filtering logic:
>   This patch filters by *owning user namespace of the network namespace a
>   given task resides in* and not by user namespace of the task per se. This
>   means if the user namespace of a given task is unshared but the network
>   namespace is kept and is owned by the initial user namespace a listener
>   that is opening the uevent socket in that network namespace can still
>   listen to uevents.
>
> [1]: https://lkml.org/lkml/2018/4/4/739
> Signed-off-by: Christian Brauner 
Acked-by: "Eric W. Biederman" 

> ---
> Changelog v1->v2:
> * patch unchanged
> Changelog v0->v1:
> * patch unchanged
> ---
>  lib/kobject_uev

Re: [PATCH bpf-next] tools/bpf: remove test_sock_addr from TEST_GEN_PROGS

2018-04-24 Thread Daniel Borkmann
On 04/24/2018 11:45 PM, Yonghong Song wrote:
> Since test_sock_addr is not supposed to run by itself,
> remove it from TEST_GEN_PROGS and add it to
> TEST_GEN_PROGS_EXTENDED. This way, run_tests will
> not run test_sock_addr. The corresponding test to run
> is test_sock_addr.sh.
> 
> Signed-off-by: Yonghong Song 

Applied to bpf-next, thanks Yonghong!


[PATCH] selftests: net: add in_netns.sh TEST_GEN_PROGS_EXTENDED

2018-04-24 Thread Anders Roxell
Script in_netns.sh is a utility function and not its own test so it
shouldn't be part of the TEST_PROGS. The in_netns.sh get used by
run_afpackettests.
To install in_netns.sh without being added to the main run_kselftest.sh
script use the TEST_GEN_PROGS_EXTENDED variable.

Fixes: 5ff9c1a3dd92 ("selftests: net: add in_netns.sh to TEST_PROGS")
Signed-off-by: Anders Roxell 
---
 tools/testing/selftests/net/Makefile | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index c3761c35f542..4a8cfe8071a7 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -5,7 +5,8 @@ CFLAGS =  -Wall -Wl,--no-as-needed -O2 -g
 CFLAGS += -I../../../../usr/include/
 
 TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh 
rtnetlink.sh
-TEST_PROGS += fib_tests.sh fib-onlink-tests.sh in_netns.sh pmtu.sh
+TEST_PROGS += fib_tests.sh fib-onlink-tests.sh pmtu.sh
+TEST_GEN_PROGS_EXTENDED := in_netns.sh
 TEST_GEN_FILES =  socket
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy
 TEST_GEN_FILES += tcp_mmap
-- 
2.11.0



Re: [PATCH net-next 2/2 v2] netns: isolate seqnums to use per-netns locks

2018-04-24 Thread Christian Brauner
On Tue, Apr 24, 2018 at 04:52:20PM -0500, Eric W. Biederman wrote:
> Christian Brauner  writes:
> 
> > Now that it's possible to have a different set of uevents in different
> > network namespaces, per-network namespace uevent sequence numbers are
> > introduced. This increases performance as locking is now restricted to the
> > network namespace affected by the uevent rather than locking everything.
> > Testing revealed significant performance improvements. For details see
> > "Testing" below.
> 
> Maybe.  Your locking is wrong, and a few other things are wrong.  see
> below.

Thanks for the review! Happy to rework this until it's in a mergeable shape.

> 
> > Since commit 692ec06 ("netns: send uevent messages") network namespaces not
> > owned by the intial user namespace can be sent uevents from a sufficiently
> > privileged userspace process.
> > In order to send a uevent into a network namespace not owned by the initial
> > user namespace we currently still need to take the *global mutex* that
> > locks the uevent socket list even though the list *only contains network
> > namespaces owned by the initial user namespace*. This needs to be done
> > because the uevent counter is a global variable. Taking the global lock is
> > performance sensitive since a user on the host can spawn a pool of n
> > process that each create their own new user and network namespaces and then
> > go on to inject uevents in parallel into the network namespace of all of
> > these processes. This can have a significant performance impact for the
> > host's udevd since it means that there can be a lot of delay between a
> > device being added and the corresponding uevent being sent out and
> > available for processing by udevd. It also means that each network
> > namespace not owned by the initial user namespace which userspace has sent
> > a uevent to will need to wait until the lock becomes available.
> >
> > Implementation:
> > This patch gives each network namespace its own uevent sequence number.
> > Each network namespace not owned by the initial user namespace receives its
> > own mutex. The struct uevent_sock is opaque to callers outside of kobject.c
> > so the mutex *can* and *is* only ever accessed in lib/kobject.c. In this
> > file it is clearly documented which lock has to be taken. All network
> > namespaces owned by the initial user namespace will still share the same
> > lock since they are all served sequentially via the uevent socket list.
> > This decouples the locking and ensures that the host retrieves uevents as
> > fast as possible even if there are a lot of uevents injected into network
> > namespaces not owned by the initial user namespace.  In addition, each
> > network namespace not owned by the initial user namespace does not have to
> > wait on any other network namespace not sharing the same user namespace.
> >
> > Testing:
> > Two 4.17-rc1 test kernels were compiled. One with per netns uevent seqnums
> > with decoupled locking and one without. To ensure that testing made sense
> > both kernels carried the patch to remove network namespaces not owned by
> > the initial user namespace from the uevent socket list.
> > Three tests were constructed. All of them showed significant performance
> > improvements with per-netns uevent sequence numbers and decoupled locking.
> >
> >  # Testcase 1:
> >Only Injecting Uevents into network namespaces not owned by the initial
> >user namespace.
> >- created 1000 new user namespace + network namespace pairs
> >- opened a uevent listener in each of those namespace pairs
> >- injected uevents into each of those network namespaces 10,000 times
> >  meaning 10,000,000 (10 million) uevents were injected. (The high
> >  number of uevent injections should get rid of a lot of jitter.)
> >  The injection was done by fork()ing 1000 uevent injectors in a simple
> >  for-loop to ensure that uevents were injected in parallel.
> >- mean transaction time was calculated:
> >  - *without* uevent sequence number namespacing: 67 μs
> >  - *with* uevent sequence number namespacing:55 μs
> >  - makes a difference of:12 μs
> >- a t-test was performed on the two data vectors which revealed
> >  shows significant performance improvements:
> >  Welch Two Sample t-test
> >  data:  x1 and y1
> >  t = 405.16, df = 18883000, p-value < 2.2e-16
> >  alternative hypothesis: true difference in means is not equal to 0
> >  95 percent confidence interval:
> >  12.14949 12.26761
> >  sample estimates:
> >  mean of x mean of y
> >  68.48594  56.27739
> >
> >  # Testcase 2:
> >Injecting Uevents into network namespaces not owned by the initial user
> >namespace and network namespaces owned by the initial user namespace.
> >- created 500 new user namespace + network namespace pairs
> >- created 500 new network namespace pairs
> >- opened a uevent listener in each

Re: [bpf-next PATCH 0/4] selftests for BPF sockmap use cases

2018-04-24 Thread Daniel Borkmann
On 04/23/2018 11:30 PM, John Fastabend wrote:
> This series moves ./samples/sockmap into BPF selftests. There are a
> few good reasons to do this. First, by pushing this into selftests
> the tests will be run automatically. Second, sockmap was not really
> a sample of anything anymore, but rather a large set of tests.
> 
> Note: There are three recent fixes outstanding against bpf branch
> that can be detected occosionally by the automated tests here.
> 
> https://patchwork.ozlabs.org/patch/903138/
> https://patchwork.ozlabs.org/patch/903139/
> https://patchwork.ozlabs.org/patch/903140/
> 
> ---
> 
> John Fastabend (4):
>   bpf: sockmap, code sockmap_test in C
>   bpf: sockmap, add a set of tests to run by default
>   bpf: sockmap, add selftests
>   bpf: sockmap, remove samples program

Applied to bpf-next, thanks John!


Re: [pci PATCH v8 0/4] Add support for unmanaged SR-IOV

2018-04-24 Thread Alexander Duyck
On Tue, Apr 24, 2018 at 2:51 PM, Bjorn Helgaas  wrote:
> On Sat, Apr 21, 2018 at 05:22:27PM -0700, Alexander Duyck wrote:
>> On Sat, Apr 21, 2018 at 1:34 PM, Bjorn Helgaas  wrote:
>
>> > For example, I'm not sure what you mean by "devices where the PF is
>> > not capable of managing VF resources."
>> >
>> > It *sounds* like you're saying the hardware works differently on some
>> > devices, but I don't think that's what you mean.  I think you're
>> > saying something about which drivers are used for the PF and the VF.
>>
>> That is sort of what I am saying.
>>
>> So for example with ixgbe there is functionality which is controlled
>> in the MMIO space of the PF that affects the functionality of the VFs
>> that are generated on the device. The PF has to rearrange the
>> resources such as queues and interrupts on the device before it can
>> enable SR-IOV, and it could alter those later to limit what the VF is
>> capable of doing.
>>
>> The model I am dealing with via this patch set has a PF that is not
>> much different than the VFs other than the fact that it has some
>> extended configuration space bits in place for SR-IOV, ARI, ACS, and
>> whatever other bits are needed in order to support spawning isolated
>> VFs.
>
> OK, thanks for the explanation, I think I understand what's going on
> now, correct me if I'm mistaken.  I added a hint about "PF" for Randy,
> too.
>
> These are on pci/virtualization for v4.18.

I reviewed them and all of the changes to patches 1 & 2 both below,
and in the tree look good to me.

Thanks for taking care of all this.

- Alex

> commit 8effc395c209
> Author: Alexander Duyck 
> Date:   Sat Apr 21 15:23:09 2018 -0500
>
> PCI/IOV: Add pci_sriov_configure_simple()
>
> SR-IOV (Single Root I/O Virtualization) is an optional PCIe capability 
> (see
> PCIe r4.0, sec 9).  A PCIe Function with the SR-IOV capability is referred
> to as a PF (Physical Function).  If SR-IOV is enabled on the PF, several
> VFs (Virtual Functions) may be created.  The VFs can be individually
> assigned to virtual machines, which allows them to share a single hardware
> device while being isolated from each other.
>
> Some SR-IOV devices have resources such as queues and interrupts that must
> be set up in the PF before enabling the VFs, so they require a PF driver 
> to
> do that.
>
> Other SR-IOV devices don't require any PF setup before enabling VFs.  Add 
> a
> pci_sriov_configure_simple() interface so PF drivers for such devices can
> use it without repeating the VF-enabling code.
>
> Tested-by: Mark Rustad 
> Signed-off-by: Alexander Duyck 
> [bhelgaas: changelog, comment]
> Signed-off-by: Bjorn Helgaas 
> Reviewed-by: Greg Rose 
> Reviewed-by: Christoph Hellwig :wq
>
> commit a8ccf8a3
> Author: Alexander Duyck 
> Date:   Tue Apr 24 16:47:16 2018 -0500
>
> PCI/IOV: Add pci-pf-stub driver for PFs that only enable VFs
>
> Some SR-IOV PF devices provide no functionality other than acting as a
> means of enabling VFs.  For these devices, we want to enable the VFs and
> assign them to guest virtual machines, but there's no need to have a 
> driver
> for the PF itself.
>
> Add a new pci-pf-stub driver to claim those PF devices and provide the
> generic VF enable functionality.  An administrator can use the sysfs
> "sriov_numvfs" file to enable VFs, then assign them to guests.
>
> For now I only have one example ID provided by Amazon in terms of devices
> that require this functionality.  The general idea is that in the future 
> we
> will see other devices added as vendors come up with devices where the PF
> is more or less just a lightweight shim used to allocate VFs.
>
> Signed-off-by: Alexander Duyck 
> [bhelgaas: changelog]
> Signed-off-by: Bjorn Helgaas 
> Reviewed-by: Greg Rose 
> Reviewed-by: Christoph Hellwig 
>
> commit 115ddc491922
> Author: Alexander Duyck 
> Date:   Tue Apr 24 16:47:22 2018 -0500
>
> net: ena: Use pci_sriov_configure_simple() to enable VFs
>
> Instead of implementing our own version of a SR-IOV configuration stub in
> the ena driver, use the existing pci_sriov_configure_simple() function.
>
> Signed-off-by: Alexander Duyck 
> Signed-off-by: Bjorn Helgaas 
> Reviewed-by: Greg Rose 
> Reviewed-by: Christoph Hellwig 
>
> commit 74d986abc20b
> Author: Alexander Duyck 
> Date:   Tue Apr 24 16:47:27 2018 -0500
>
> nvme-pci: Use pci_sriov_configure_simple() to enable VFs
>
> Instead of implementing our own version of a SR-IOV configuration stub in
> the nvme driver, use the existing pci_sriov_configure_simple() function.
>
> Signed-off-by: Alexander Duyck 
> Signed-off-by: Bjorn Helgaas 
> Reviewed-by: Christoph Hellwig 


Re: [PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-24 Thread Eric W. Biederman

Bah.  This code is obviously correct and probably wrong.

How do we deliver uevents for network devices that are outside of the
initial user namespace?  The kernel still needs to deliver those.

The logic to figure out which network namespace a device needs to be
delivered to is is present in kobj_bcast_filter.  That logic will almost
certainly need to be turned inside out.  Sign not as easy as I would
have hoped.

Eric

Christian Brauner  writes:
> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
>
> enabled sending hotplug events into all network namespaces back in 2010.
> Over time the set of uevents that get sent into all network namespaces has
> shrunk a little. We have now reached the point where hotplug events for all
> devices that carry a namespace tag are filtered according to that
> namespace. Specifically, they are filtered whenever the namespace tag of
> the kobject does not match the namespace tag of the netlink socket. One
> example are network devices. Uevents for network devices only show up in
> the network namespaces these devices are moved to or created in.
>
> However, any uevent for a kobject that does not have a namespace tag
> associated with it will not be filtered and we will broadcast it into all
> network namespaces. This behavior stopped making sense when user namespaces
> were introduced.
>
> This patch restricts uevents to the initial user namespace for a couple of
> reasons that have been extensively discusses on the mailing list [1].
> - Thundering herd:
>   Broadcasting uevents into all network namespaces introduces significant
>   overhead.
>   All processes that listen to uevents running in non-initial user
>   namespaces will end up responding to uevents that will be meaningless to
>   them. Mainly, because non-initial user namespaces cannot easily manage
>   devices unless they have a privileged host-process helping them out. This
>   means that there will be a thundering herd of activity when there
>   shouldn't be any.
> - Uevents from non-root users are already filtered in userspace:
>   Uevents are filtered by userspace in a user namespace because the
>   received uid != 0. Instead the uid associated with the event will be
>   65534 == "nobody" because the global root uid is not mapped.
>   This means we can safely and without introducing regressions modify the
>   kernel to not send uevents into all network namespaces whose owning user
>   namespace is not the initial user namespace because we know that
>   userspace will ignore the message because of the uid anyway. I have
>   a) verified that is is true for every udev implementation out there b)
>   that this behavior has been present in all udev implementations from the
>   very beginning.
> - Removing needless overhead/Increasing performance:
>   Currently, the uevent socket for each network namespace is added to the
>   global variable uevent_sock_list. The list itself needs to be protected
>   by a mutex. So everytime a uevent is generated the mutex is taken on the
>   list. The mutex is held *from the creation of the uevent (memory
>   allocation, string creation etc. until all uevent sockets have been
>   handled*. This is aggravated by the fact that for each uevent socket that
>   has listeners the mc_list must be walked as well which means we're
>   talking O(n^2) here. Given that a standard Linux workload usually has
>   quite a lot of network namespaces and - in the face of containers - a lot
>   of user namespaces this quickly becomes a performance problem (see
>   "Thundering herd" above). By just recording uevent sockets of network
>   namespaces that are owned by the initial user namespace we significantly
>   increase performance in this codepath.
> - Injecting uevents:
>   There's a valid argument that containers might be interested in receiving
>   device events especially if they are delegated to them by a privileged
>   userspace process. One prime example are SR-IOV enabled devices that are
>   explicitly designed to be handed of to other users such as VMs or
>   containers.
>   This use-case can now be correctly handled since
>   commit 692ec06d7c92 ("netns: send uevent messages"). This commit
>   introduced the ability to send uevents from userspace. As such we can let
>   a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of
>   the network namespace of the netlink socket) userspace process make a
>   decision what uevents should be sent. This removes the need to blindly
>   broadcast uevents into all user namespaces and provides a performant and
>   safe solution to this problem.
> - Filtering logic:
>   This patch filters by *owning user namespace of the network namespace a
>   given task resides in* and not by user namespace of the task per se. This
>   means if the user namespace of a given task is unshared but the network
>   namespace is kept and is owned by the initial user namespace a listener
>   that is opening the uevent socket i

Re: [PATCH net-next 2/2 v2] netns: isolate seqnums to use per-netns locks

2018-04-24 Thread Eric W. Biederman
Christian Brauner  writes:

> On Tue, Apr 24, 2018 at 04:52:20PM -0500, Eric W. Biederman wrote:
>> Christian Brauner  writes:
>> 
>> > Now that it's possible to have a different set of uevents in different
>> > network namespaces, per-network namespace uevent sequence numbers are
>> > introduced. This increases performance as locking is now restricted to the
>> > network namespace affected by the uevent rather than locking everything.
>> > Testing revealed significant performance improvements. For details see
>> > "Testing" below.
>> 
>> Maybe.  Your locking is wrong, and a few other things are wrong.  see
>> below.
>
> Thanks for the review! Happy to rework this until it's in a mergeable shape.
>
>> 
>> > Since commit 692ec06 ("netns: send uevent messages") network namespaces not
>> > owned by the intial user namespace can be sent uevents from a sufficiently
>> > privileged userspace process.
>> > In order to send a uevent into a network namespace not owned by the initial
>> > user namespace we currently still need to take the *global mutex* that
>> > locks the uevent socket list even though the list *only contains network
>> > namespaces owned by the initial user namespace*. This needs to be done
>> > because the uevent counter is a global variable. Taking the global lock is
>> > performance sensitive since a user on the host can spawn a pool of n
>> > process that each create their own new user and network namespaces and then
>> > go on to inject uevents in parallel into the network namespace of all of
>> > these processes. This can have a significant performance impact for the
>> > host's udevd since it means that there can be a lot of delay between a
>> > device being added and the corresponding uevent being sent out and
>> > available for processing by udevd. It also means that each network
>> > namespace not owned by the initial user namespace which userspace has sent
>> > a uevent to will need to wait until the lock becomes available.
>> >
>> > Implementation:
>> > This patch gives each network namespace its own uevent sequence number.
>> > Each network namespace not owned by the initial user namespace receives its
>> > own mutex. The struct uevent_sock is opaque to callers outside of kobject.c
>> > so the mutex *can* and *is* only ever accessed in lib/kobject.c. In this
>> > file it is clearly documented which lock has to be taken. All network
>> > namespaces owned by the initial user namespace will still share the same
>> > lock since they are all served sequentially via the uevent socket list.
>> > This decouples the locking and ensures that the host retrieves uevents as
>> > fast as possible even if there are a lot of uevents injected into network
>> > namespaces not owned by the initial user namespace.  In addition, each
>> > network namespace not owned by the initial user namespace does not have to
>> > wait on any other network namespace not sharing the same user namespace.
>> >
>> > Testing:
>> > Two 4.17-rc1 test kernels were compiled. One with per netns uevent seqnums
>> > with decoupled locking and one without. To ensure that testing made sense
>> > both kernels carried the patch to remove network namespaces not owned by
>> > the initial user namespace from the uevent socket list.
>> > Three tests were constructed. All of them showed significant performance
>> > improvements with per-netns uevent sequence numbers and decoupled locking.
>> >
>> >  # Testcase 1:
>> >Only Injecting Uevents into network namespaces not owned by the initial
>> >user namespace.
>> >- created 1000 new user namespace + network namespace pairs
>> >- opened a uevent listener in each of those namespace pairs
>> >- injected uevents into each of those network namespaces 10,000 times
>> >  meaning 10,000,000 (10 million) uevents were injected. (The high
>> >  number of uevent injections should get rid of a lot of jitter.)
>> >  The injection was done by fork()ing 1000 uevent injectors in a simple
>> >  for-loop to ensure that uevents were injected in parallel.
>> >- mean transaction time was calculated:
>> >  - *without* uevent sequence number namespacing: 67 μs
>> >  - *with* uevent sequence number namespacing:55 μs
>> >  - makes a difference of:12 μs
>> >- a t-test was performed on the two data vectors which revealed
>> >  shows significant performance improvements:
>> >  Welch Two Sample t-test
>> >  data:  x1 and y1
>> >  t = 405.16, df = 18883000, p-value < 2.2e-16
>> >  alternative hypothesis: true difference in means is not equal to 0
>> >  95 percent confidence interval:
>> >  12.14949 12.26761
>> >  sample estimates:
>> >  mean of x mean of y
>> >  68.48594  56.27739
>> >
>> >  # Testcase 2:
>> >Injecting Uevents into network namespaces not owned by the initial user
>> >namespace and network namespaces owned by the initial user namespace.
>> >- created 500 new user namespace 

Re: [PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-24 Thread Christian Brauner
On Tue, Apr 24, 2018 at 05:40:07PM -0500, Eric W. Biederman wrote:
> 
> Bah.  This code is obviously correct and probably wrong.
> 
> How do we deliver uevents for network devices that are outside of the
> initial user namespace?  The kernel still needs to deliver those.
> 
> The logic to figure out which network namespace a device needs to be
> delivered to is is present in kobj_bcast_filter.  That logic will almost
> certainly need to be turned inside out.  Sign not as easy as I would
> have hoped.

That's why my initial patch [1] added additional filtering logic to
kobj_bcast_filter(). But since we care about performance improvements as
well I can come up with a patch that moves this logic out of
kobj_bcast_filter().

Christian
[1]: https://www.spinics.net/lists/netdev/msg494487.html

> 
> Eric
> 
> Christian Brauner  writes:
> > commit 07e98962fa77 ("kobject: Send hotplug events in all network 
> > namespaces")
> >
> > enabled sending hotplug events into all network namespaces back in 2010.
> > Over time the set of uevents that get sent into all network namespaces has
> > shrunk a little. We have now reached the point where hotplug events for all
> > devices that carry a namespace tag are filtered according to that
> > namespace. Specifically, they are filtered whenever the namespace tag of
> > the kobject does not match the namespace tag of the netlink socket. One
> > example are network devices. Uevents for network devices only show up in
> > the network namespaces these devices are moved to or created in.
> >
> > However, any uevent for a kobject that does not have a namespace tag
> > associated with it will not be filtered and we will broadcast it into all
> > network namespaces. This behavior stopped making sense when user namespaces
> > were introduced.
> >
> > This patch restricts uevents to the initial user namespace for a couple of
> > reasons that have been extensively discusses on the mailing list [1].
> > - Thundering herd:
> >   Broadcasting uevents into all network namespaces introduces significant
> >   overhead.
> >   All processes that listen to uevents running in non-initial user
> >   namespaces will end up responding to uevents that will be meaningless to
> >   them. Mainly, because non-initial user namespaces cannot easily manage
> >   devices unless they have a privileged host-process helping them out. This
> >   means that there will be a thundering herd of activity when there
> >   shouldn't be any.
> > - Uevents from non-root users are already filtered in userspace:
> >   Uevents are filtered by userspace in a user namespace because the
> >   received uid != 0. Instead the uid associated with the event will be
> >   65534 == "nobody" because the global root uid is not mapped.
> >   This means we can safely and without introducing regressions modify the
> >   kernel to not send uevents into all network namespaces whose owning user
> >   namespace is not the initial user namespace because we know that
> >   userspace will ignore the message because of the uid anyway. I have
> >   a) verified that is is true for every udev implementation out there b)
> >   that this behavior has been present in all udev implementations from the
> >   very beginning.
> > - Removing needless overhead/Increasing performance:
> >   Currently, the uevent socket for each network namespace is added to the
> >   global variable uevent_sock_list. The list itself needs to be protected
> >   by a mutex. So everytime a uevent is generated the mutex is taken on the
> >   list. The mutex is held *from the creation of the uevent (memory
> >   allocation, string creation etc. until all uevent sockets have been
> >   handled*. This is aggravated by the fact that for each uevent socket that
> >   has listeners the mc_list must be walked as well which means we're
> >   talking O(n^2) here. Given that a standard Linux workload usually has
> >   quite a lot of network namespaces and - in the face of containers - a lot
> >   of user namespaces this quickly becomes a performance problem (see
> >   "Thundering herd" above). By just recording uevent sockets of network
> >   namespaces that are owned by the initial user namespace we significantly
> >   increase performance in this codepath.
> > - Injecting uevents:
> >   There's a valid argument that containers might be interested in receiving
> >   device events especially if they are delegated to them by a privileged
> >   userspace process. One prime example are SR-IOV enabled devices that are
> >   explicitly designed to be handed of to other users such as VMs or
> >   containers.
> >   This use-case can now be correctly handled since
> >   commit 692ec06d7c92 ("netns: send uevent messages"). This commit
> >   introduced the ability to send uevents from userspace. As such we can let
> >   a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of
> >   the network namespace of the netlink socket) userspace process make a
> >   decision what uevents should

Re: [PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-24 Thread Eric W. Biederman
Christian Brauner  writes:

> On Wed, Apr 25, 2018, 00:41 Eric W. Biederman  wrote:
>
>  Bah. This code is obviously correct and probably wrong.
>
>  How do we deliver uevents for network devices that are outside of the
>  initial user namespace? The kernel still needs to deliver those.
>
>  The logic to figure out which network namespace a device needs to be
>  delivered to is is present in kobj_bcast_filter. That logic will almost
>  certainly need to be turned inside out. Sign not as easy as I would
>  have hoped.
>
> My first patch that we discussed put additional filtering logic into 
> kobj_bcast_filter for that very reason. But I can move that logic
> out and come up with a new patch.

I may have mis-understood.

I heard and am still hearing additional filtering to reduce the places
the packet is delievered.

I am saying something needs to change to increase the number of places
the packet is delivered.

For the special class of devices that kobj_bcast_filter would apply to
those need to be delivered to netowrk namespaces  that are no longer on
uevent_sock_list.

So the code fundamentally needs to split into two paths.  Ordinary
devices that use uevent_sock_list.  Network devices that are just
delivered in their own network namespace.

netlink_broadcast_filtered gets to go away completely.
The logic of figuring out the network namespace though becomes trickier.

Now it may make sense to have all of that as an additional patch on top
of this one or perhaps a precursor patch that addresses the problem.  We
will unfortunately drop those uevents today because their uids are not
valid.  But they are not delivered anywhere else so to allow them to be
received we need to fix them.

Eric

>
>  Christian Brauner  writes:
>  > commit 07e98962fa77 ("kobject: Send hotplug events in all network 
> namespaces")
>  >
>  > enabled sending hotplug events into all network namespaces back in 2010.
>  > Over time the set of uevents that get sent into all network namespaces has
>  > shrunk a little. We have now reached the point where hotplug events for all
>  > devices that carry a namespace tag are filtered according to that
>  > namespace. Specifically, they are filtered whenever the namespace tag of
>  > the kobject does not match the namespace tag of the netlink socket. One
>  > example are network devices. Uevents for network devices only show up in
>  > the network namespaces these devices are moved to or created in.
>  >
>  > However, any uevent for a kobject that does not have a namespace tag
>  > associated with it will not be filtered and we will broadcast it into all
>  > network namespaces. This behavior stopped making sense when user namespaces
>  > were introduced.
>  >
>  > This patch restricts uevents to the initial user namespace for a couple of
>  > reasons that have been extensively discusses on the mailing list [1].
>  > - Thundering herd:
>  > Broadcasting uevents into all network namespaces introduces significant
>  > overhead.
>  > All processes that listen to uevents running in non-initial user
>  > namespaces will end up responding to uevents that will be meaningless to
>  > them. Mainly, because non-initial user namespaces cannot easily manage
>  > devices unless they have a privileged host-process helping them out. This
>  > means that there will be a thundering herd of activity when there
>  > shouldn't be any.
>  > - Uevents from non-root users are already filtered in userspace:
>  > Uevents are filtered by userspace in a user namespace because the
>  > received uid != 0. Instead the uid associated with the event will be
>  > 65534 == "nobody" because the global root uid is not mapped.
>  > This means we can safely and without introducing regressions modify the
>  > kernel to not send uevents into all network namespaces whose owning user
>  > namespace is not the initial user namespace because we know that
>  > userspace will ignore the message because of the uid anyway. I have
>  > a) verified that is is true for every udev implementation out there b)
>  > that this behavior has been present in all udev implementations from the
>  > very beginning.
>  > - Removing needless overhead/Increasing performance:
>  > Currently, the uevent socket for each network namespace is added to the
>  > global variable uevent_sock_list. The list itself needs to be protected
>  > by a mutex. So everytime a uevent is generated the mutex is taken on the
>  > list. The mutex is held *from the creation of the uevent (memory
>  > allocation, string creation etc. until all uevent sockets have been
>  > handled*. This is aggravated by the fact that for each uevent socket that
>  > has listeners the mc_list must be walked as well which means we're
>  > talking O(n^2) here. Given that a standard Linux workload usually has
>  > quite a lot of network namespaces and - in the face of containers - a lot
>  > of user namespaces this quickly becomes a performance problem (see
>  > "Thundering herd" above). By just recordin

Re: [PATCH] [PATCH bpf-next] samples/bpf/bpf_load.c: remove redundant ret, assignment in bpf_load_program()

2018-04-24 Thread Daniel Borkmann
On 04/24/2018 10:18 AM, shhuiw wrote:
> 
> 2 redundant ret assignments removded:
> * 'ret = 1' before the logic 'if (data_maps)', and if any errors jump to
>   label 'done'. No 'ret = 1' needed before the error jump.
> * After the '/* load programs */' part, if everything goes well, then
>   the BPF code will be loaded and 'ret' set to 0 by load_and_attach().
>   If something goes wrong, 'ret' set to none-O, the redundant 'ret = 0'
>   after the for clause will make the error skipped.
>   For example, if some BPF code cannot provide supported program types
>   in ELF SEC("unknown"), the for clause will not call load_and_attach()
>   to load the BPF code. 1 should be returned to callees instead of 0.
> 
> Signed-off-by: Wang Sheng-Hui 

Your patch is corrupted, please use something like git-send-email(1).

Thanks,
Daniel


[bpf-next PATCH] bpf: reduce runtime of test_sockmap tests

2018-04-24 Thread John Fastabend
When test_sockmap was running outside of selftests and was not being
run by build bots it was reasonable to spend significant amount of
time running various tests. The number of tests is high because many
different I/O iterators are run.

However, now that test_sockmap is part of selftests rather than
iterate through all I/O sides only test a minimal set of min/max
values along with a few "normal" I/O ops. Also remove the long
running tests. They can be run from other test frameworks on a regular
cadence.

This significanly reduces runtime of test_sockmap.

Before:

$ time sudo ./test_sockmap  > /dev/null

real4m47.521s
user0m0.370s
sys 0m3.131s

After:

$ time sudo ./test_sockmap  > /dev/null

real0m0.514s
user0m0.104s
sys 0m0.430s

The CLI is still available for users that want to test the long
running tests that do the larger send/recv tests.

Signed-off-by: John Fastabend 
---
 tools/testing/selftests/bpf/test_sockmap.c |   33 ++--
 1 file changed, 16 insertions(+), 17 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_sockmap.c 
b/tools/testing/selftests/bpf/test_sockmap.c
index 6d63a1c..29c022d 100644
--- a/tools/testing/selftests/bpf/test_sockmap.c
+++ b/tools/testing/selftests/bpf/test_sockmap.c
@@ -344,8 +344,8 @@ static int msg_loop(int fd, int iov_count, int iov_length, 
int cnt,
if (err < 0)
perror("recv start time: ");
while (s->bytes_recvd < total_bytes) {
-   timeout.tv_sec = 1;
-   timeout.tv_usec = 0;
+   timeout.tv_sec = 0;
+   timeout.tv_usec = 10;
 
/* FD sets */
FD_ZERO(&w);
@@ -903,12 +903,10 @@ static int test_exec(int cgrp, struct sockmap_options 
*opt)
 {
int err = __test_exec(cgrp, SENDMSG, opt);
 
-   sched_yield();
if (err)
goto out;
 
err = __test_exec(cgrp, SENDPAGE, opt);
-   sched_yield();
 out:
return err;
 }
@@ -928,19 +926,18 @@ static int test_loop(int cgrp)
opt.iov_length = 0;
opt.rate = 0;
 
-   for (r = 1; r < 100; r += 33) {
-   for (i = 1; i < 100; i += 33) {
-   for (l = 1; l < 100; l += 33) {
-   opt.rate = r;
-   opt.iov_count = i;
-   opt.iov_length = l;
-   err = test_exec(cgrp, &opt);
-   if (err)
-   goto out;
-   }
+   r = 1;
+   for (i = 1; i < 100; i += 33) {
+   for (l = 1; l < 100; l += 33) {
+   opt.rate = r;
+   opt.iov_count = i;
+   opt.iov_length = l;
+   err = test_exec(cgrp, &opt);
+   if (err)
+   goto out;
}
}
-
+   sched_yield();
 out:
return err;
 }
@@ -1031,6 +1028,7 @@ static int test_send(struct sockmap_options *opt, int 
cgrp)
if (err)
goto out;
 out:
+   sched_yield();
return err;
 }
 
@@ -1168,7 +1166,7 @@ static int test_start_end(int cgrp)
opt.iov_length = 100;
txmsg_cork = 1600;
 
-   for (i = 99; i <= 1600; i += 100) {
+   for (i = 99; i <= 1600; i += 500) {
txmsg_start = 0;
txmsg_end = i;
err = test_exec(cgrp, &opt);
@@ -1177,7 +1175,7 @@ static int test_start_end(int cgrp)
}
 
/* Test start/end with cork but pull data in middle */
-   for (i = 199; i <= 1600; i += 100) {
+   for (i = 199; i <= 1600; i += 500) {
txmsg_start = 100;
txmsg_end = i;
err = test_exec(cgrp, &opt);
@@ -1221,6 +1219,7 @@ static int test_start_end(int cgrp)
 out:
txmsg_start = 0;
txmsg_end = 0;
+   sched_yield();
return err;
 }
 



[PATCH v3 ipsec-next] xfrm: remove VLA usage in __xfrm6_sort()

2018-04-24 Thread Kees Cook
In the quest to remove all stack VLA usage removed from the kernel[1],
just use XFRM_MAX_DEPTH as already done for the "class" array. In one
case, it'll do this loop up to 5, the other caller up to 6.

[1] https://lkml.org/lkml/2018/3/7/621

Co-developed-by: Andreas Christoforou 
Signed-off-by: Kees Cook 
---
v3:
- adjust Subject and commit log (Steffen)
- use "= { }" instead of memset() (Stefano)
- reorder variables (Stefano)
v2:
- use XFRM_MAX_DEPTH for "count" array (Steffen and Mathias).
---
 net/ipv6/xfrm6_state.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/xfrm6_state.c b/net/ipv6/xfrm6_state.c
index 16f434791763..eeb44b64ae7f 100644
--- a/net/ipv6/xfrm6_state.c
+++ b/net/ipv6/xfrm6_state.c
@@ -60,9 +60,9 @@ xfrm6_init_temprop(struct xfrm_state *x, const struct 
xfrm_tmpl *tmpl,
 static int
 __xfrm6_sort(void **dst, void **src, int n, int (*cmp)(void *p), int maxclass)
 {
-   int i;
+   int count[XFRM_MAX_DEPTH] = { };
int class[XFRM_MAX_DEPTH];
-   int count[maxclass];
+   int i;
 
memset(count, 0, sizeof(count));
 
-- 
2.7.4


-- 
Kees Cook
Pixel Security


[PATCH v3] ath9k: dfs: Remove VLA usage

2018-04-24 Thread Kees Cook
In the quest to remove all stack VLA usage from the kernel[1], this
redefines FFT_NUM_SAMPLES as a #define instead of const int, which still
triggers gcc's VLA checking pass.

[1] https://lkml.org/lkml/2018/3/7/621

Co-developed-by: Andreas Christoforou 
Signed-off-by: Kees Cook 
---
v3: replace FFT_NUM_SAMPLES as a #define (Joe)
---
 drivers/net/wireless/ath/ath9k/dfs.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/wireless/ath/ath9k/dfs.c 
b/drivers/net/wireless/ath/ath9k/dfs.c
index 6fee9a464cce..e6e56a925121 100644
--- a/drivers/net/wireless/ath/ath9k/dfs.c
+++ b/drivers/net/wireless/ath/ath9k/dfs.c
@@ -40,8 +40,8 @@ static const int BIN_DELTA_MIN= 1;
 static const int BIN_DELTA_MAX = 10;
 
 /* we need at least 3 deltas / 4 samples for a reliable chirp detection */
-#define NUM_DIFFS 3
-static const int FFT_NUM_SAMPLES   = (NUM_DIFFS + 1);
+#define NUM_DIFFS  3
+#define FFT_NUM_SAMPLES(NUM_DIFFS + 1)
 
 /* Threshold for difference of delta peaks */
 static const int MAX_DIFF  = 2;
-- 
2.7.4


-- 
Kees Cook
Pixel Security


Re: [PATCH v2] bpf, x86_32: add eBPF JIT compiler for ia32

2018-04-24 Thread Daniel Borkmann
On 04/19/2018 05:54 PM, Wang YanQing wrote:
> The JIT compiler emits ia32 bit instructions. Currently, It supports
> eBPF only. Classic BPF is supported because of the conversion by BPF core.
> 
> Almost all instructions from eBPF ISA supported except the following:
> BPF_ALU64 | BPF_DIV | BPF_K
> BPF_ALU64 | BPF_DIV | BPF_X
> BPF_ALU64 | BPF_MOD | BPF_K
> BPF_ALU64 | BPF_MOD | BPF_X
> BPF_STX | BPF_XADD | BPF_W
> BPF_STX | BPF_XADD | BPF_DW
> 
> It doesn't support BPF_JMP|BPF_CALL with BPF_PSEUDO_CALL too.
> 
> ia32 has few general purpose registers, EAX|EDX|ECX|EBX|ESI|EDI,
> and in these six registers, we can't treat all of them as real
> general purpose registers in jit:
> MUL instructions need EAX:EDX, shift instructions need ECX.
> 
> So I decide to use stack to emulate all eBPF 64 registers, this will
> simplify the implementation a lot, because we don't need to face the
> flexible memory address modes on ia32, for example, we don't need to
> write below code pattern for one BPF_ADD instruction:
> 
> if (src is a register && dst is a register)
> {
>//one instruction encoding for ADD instruction
> } else if (only src is a register)
> {
>//another different instruction encoding for ADD instruction
> } else if (only dst is a register)
> {
>//another different instruction encoding for ADD instruction
> } else
> {
>//src and dst are all on stack.
>//move src or dst to temporary registers
> }
> 
> If the above example if-else-else-else isn't so painful, try to think
> it for BPF_ALU64|BPF_*SHIFT* instruction which we need to use many
> native instructions to emulate.
> 
> Tested on my PC (Intel(R) Core(TM) i5-5200U CPU) and virtualbox.

Just out of plain curiosity, do you have a specific use case on the
JIT for x86_32? Are you targeting this to be used for e.g. Atom or
the like (sorry, don't really have a good overview where x86_32 is
still in use these days)?

> Testing results on i5-5200U:
> 
> 1) test_bpf: Summary: 349 PASSED, 0 FAILED, [319/341 JIT'ed]
> 2) test_progs: Summary: 81 PASSED, 2 FAILED.
>test_progs report "libbpf: incorrect bpf_call opcode" for
>test_l4lb_noinline and test_xdp_noinline, because there is
>no llvm-6.0 on my machine, and current implementation doesn't
>support BPF_PSEUDO_CALL, so I think we can ignore the two failed
>testcases.
> 3) test_lpm: OK
> 4) test_lru_map: OK
> 5) test_verifier: Summary: 823 PASSED, 5 FAILED
>test_verifier report "invalid bpf_context access off=68 size=1/2/4/8"
>for all the 5 FAILED testcases with/without jit, we need to fix the
>failed testcases themself instead of this jit.

Can you elaborate further on these? Looks like this definitely needs
fixing on 32 bit. Would be great to get a better understanding of the
underlying bug(s) and properly fix them.

> Above tests are all done with following flags settings discretely:
> 1:bpf_jit_enable=1 and bpf_jit_harden=0
> 2:bpf_jit_enable=1 and bpf_jit_harden=2
> 
> Below are some numbers for this jit implementation:
> Note:
>   I run test_progs in kselftest 100 times continuously for every testcase,
>   the numbers are in format: total/times=avg.
>   The numbers that test_bpf reports show almost the same relation.
> 
> a:jit_enable=0 and jit_harden=0b:jit_enable=1 and jit_harden=0
>   test_pkt_access:PASS:ipv4:15622/100=156  
> test_pkt_access:PASS:ipv4:10057/100=100
>   test_pkt_access:PASS:ipv6:9130/100=91
> test_pkt_access:PASS:ipv6:5055/100=50
>   test_xdp:PASS:ipv4:240198/100=2401   test_xdp:PASS:ipv4:145945/100=1459
>   test_xdp:PASS:ipv6:137326/100=1373   test_xdp:PASS:ipv6:67337/100=673
>   test_l4lb:PASS:ipv4:61100/100=611test_l4lb:PASS:ipv4:38137/100=381
>   test_l4lb:PASS:ipv6:101000/100=1010  test_l4lb:PASS:ipv6:57779/100=577
> 
> c:jit_enable=1 and jit_harden=2
>   test_pkt_access:PASS:ipv4:12650/100=126
>   test_pkt_access:PASS:ipv6:7074/100=70
>   test_xdp:PASS:ipv4:147211/100=1472
>   test_xdp:PASS:ipv6:85783/100=857
>   test_l4lb:PASS:ipv4:53222/100=532
>   test_l4lb:PASS:ipv6:76322/100=763
> 
> Yes, the numbers are pretty when turn off jit_harden, if we want to speedup
> jit_harden, then we need to move BPF_REG_AX to *real* register instead of 
> stack
> emulation, but when we do it, we need to face all the pain I describe above. 
> We
> can do it in next step.
> 
> See Documentation/networking/filter.txt for more information.
> 
> Signed-off-by: Wang YanQing 
[...]
> + /* call */
> + case BPF_JMP | BPF_CALL:
> + {
> + const u8 *r1 = bpf2ia32[BPF_REG_1];
> + const u8 *r2 = bpf2ia32[BPF_REG_2];
> + const u8 *r3 = bpf2ia32[BPF_REG_3];
> + const u8 *r4 = bpf2ia32[BPF_REG_4];
> + const u8 *r5 = bpf2ia32[BPF_REG_5];
> +
> + if (insn->src_reg == BPF_PSEUDO_CALL)
> + goto no

Re: [bpf-next PATCH] bpf: reduce runtime of test_sockmap tests

2018-04-24 Thread Daniel Borkmann
On 04/25/2018 01:28 AM, John Fastabend wrote:
> When test_sockmap was running outside of selftests and was not being
> run by build bots it was reasonable to spend significant amount of
> time running various tests. The number of tests is high because many
> different I/O iterators are run.
> 
> However, now that test_sockmap is part of selftests rather than
> iterate through all I/O sides only test a minimal set of min/max
> values along with a few "normal" I/O ops. Also remove the long
> running tests. They can be run from other test frameworks on a regular
> cadence.
> 
> This significanly reduces runtime of test_sockmap.
> 
> Before:
> 
> $ time sudo ./test_sockmap  > /dev/null
> 
> real4m47.521s
> user0m0.370s
> sys 0m3.131s
> 
> After:
> 
> $ time sudo ./test_sockmap  > /dev/null
> 
> real0m0.514s
> user0m0.104s
> sys 0m0.430s
> 
> The CLI is still available for users that want to test the long
> running tests that do the larger send/recv tests.
> 
> Signed-off-by: John Fastabend 

Applied to bpf-next, thanks John!


[PATCH net-next v2 6/7] net: dsa: Allow providing PHY statistics from CPU port

2018-04-24 Thread Florian Fainelli
Implement the same type of ethtool diversion that we have for
ETH_SS_STATS and make it work with ETH_SS_PHY_STATS. This allows
providing PHY level statistics for CPU ports that are directly
connecting to a PHY device.

Signed-off-by: Florian Fainelli 
---
 include/net/dsa.h |  7 +++
 net/dsa/master.c  | 47 -
 net/dsa/port.c| 57 +++
 3 files changed, 106 insertions(+), 5 deletions(-)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 0bc0aad1b02e..462e9741b210 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -361,6 +361,8 @@ struct dsa_switch_ops {
void(*get_ethtool_stats)(struct dsa_switch *ds,
 int port, uint64_t *data);
int (*get_sset_count)(struct dsa_switch *ds, int port, int sset);
+   void(*get_ethtool_phy_stats)(struct dsa_switch *ds,
+int port, uint64_t *data);
 
/*
 * ethtool Wake-on-LAN
@@ -589,4 +591,9 @@ static inline int call_dsa_notifiers(unsigned long val, 
struct net_device *dev,
 #define BRCM_TAG_GET_PORT(v)   ((v) >> 8)
 #define BRCM_TAG_GET_QUEUE(v)  ((v) & 0xff)
 
+
+int dsa_port_get_phy_strings(struct dsa_port *dp, uint8_t *data);
+int dsa_port_get_ethtool_phy_stats(struct dsa_port *dp, uint64_t *data);
+int dsa_port_get_phy_sset_count(struct dsa_port *dp);
+
 #endif
diff --git a/net/dsa/master.c b/net/dsa/master.c
index 8d27687fd0ca..c90ee3227dea 100644
--- a/net/dsa/master.c
+++ b/net/dsa/master.c
@@ -31,6 +31,32 @@ static void dsa_master_get_ethtool_stats(struct net_device 
*dev,
ds->ops->get_ethtool_stats(ds, port, data + count);
 }
 
+static void dsa_master_get_ethtool_phy_stats(struct net_device *dev,
+struct ethtool_stats *stats,
+uint64_t *data)
+{
+   struct dsa_port *cpu_dp = dev->dsa_ptr;
+   const struct ethtool_ops *ops = cpu_dp->orig_ethtool_ops;
+   struct dsa_switch *ds = cpu_dp->ds;
+   int port = cpu_dp->index;
+   int count = 0;
+
+   if (dev->phydev && !ops->get_ethtool_phy_stats) {
+   count = phy_ethtool_get_sset_count(dev->phydev);
+   if (count >= 0)
+   phy_ethtool_get_stats(dev->phydev, stats, data);
+   } else if (ops->get_sset_count && ops->get_ethtool_phy_stats) {
+   count = ops->get_sset_count(dev, ETH_SS_PHY_STATS);
+   ops->get_ethtool_phy_stats(dev, stats, data);
+   }
+
+   if (count < 0)
+   count = 0;
+
+   if (ds->ops->get_ethtool_phy_stats)
+   ds->ops->get_ethtool_phy_stats(ds, port, data + count);
+}
+
 static int dsa_master_get_sset_count(struct net_device *dev, int sset)
 {
struct dsa_port *cpu_dp = dev->dsa_ptr;
@@ -38,11 +64,14 @@ static int dsa_master_get_sset_count(struct net_device 
*dev, int sset)
struct dsa_switch *ds = cpu_dp->ds;
int count = 0;
 
-   if (ops->get_sset_count) {
+   if (sset == ETH_SS_PHY_STATS && dev->phydev &&
+   !ops->get_ethtool_phy_stats)
+   count = phy_ethtool_get_sset_count(dev->phydev);
+   else if (ops->get_sset_count)
count = ops->get_sset_count(dev, sset);
-   if (count < 0)
-   count = 0;
-   }
+
+   if (count < 0)
+   count = 0;
 
if (ds->ops->get_sset_count)
count += ds->ops->get_sset_count(ds, cpu_dp->index, sset);
@@ -67,7 +96,14 @@ static void dsa_master_get_strings(struct net_device *dev, 
uint32_t stringset,
/* We do not want to be NULL-terminated, since this is a prefix */
pfx[sizeof(pfx) - 1] = '_';
 
-   if (ops->get_sset_count && ops->get_strings) {
+   if (stringset == ETH_SS_PHY_STATS && dev->phydev &&
+   !ops->get_ethtool_phy_stats) {
+   mcount = phy_ethtool_get_sset_count(dev->phydev);
+   if (mcount < 0)
+   mcount = 0;
+   else
+   phy_ethtool_get_strings(dev->phydev, data);
+   } else if (ops->get_sset_count && ops->get_strings) {
mcount = ops->get_sset_count(dev, stringset);
if (mcount < 0)
mcount = 0;
@@ -107,6 +143,7 @@ static int dsa_master_ethtool_setup(struct net_device *dev)
ops->get_sset_count = dsa_master_get_sset_count;
ops->get_ethtool_stats = dsa_master_get_ethtool_stats;
ops->get_strings = dsa_master_get_strings;
+   ops->get_ethtool_phy_stats = dsa_master_get_ethtool_phy_stats;
 
dev->ethtool_ops = ops;
 
diff --git a/net/dsa/port.c b/net/dsa/port.c
index 5e2a88720a9a..2413beb995be 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -383,3 +383,60 @@ void dsa_port_link_unregister_of(struct dsa_port *dp)
else
dsa_port_

[PATCH net-next v2 5/7] net: dsa: Add helper function to obtain PHY device of a given port

2018-04-24 Thread Florian Fainelli
In preparation for having more call sites attempting to obtain a
reference against a PHY device corresponding to a particular port,
introduce a helper function for that purpose.

Signed-off-by: Florian Fainelli 
---
 net/dsa/port.c | 33 ++---
 1 file changed, 22 insertions(+), 11 deletions(-)

diff --git a/net/dsa/port.c b/net/dsa/port.c
index 7acc1169d75e..5e2a88720a9a 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -273,25 +273,38 @@ int dsa_port_vlan_del(struct dsa_port *dp,
return 0;
 }
 
-static int dsa_port_setup_phy_of(struct dsa_port *dp, bool enable)
+static struct phy_device *dsa_port_get_phy_device(struct dsa_port *dp)
 {
-   struct device_node *port_dn = dp->dn;
struct device_node *phy_dn;
-   struct dsa_switch *ds = dp->ds;
struct phy_device *phydev;
-   int port = dp->index;
-   int err = 0;
 
-   phy_dn = of_parse_phandle(port_dn, "phy-handle", 0);
+   phy_dn = of_parse_phandle(dp->dn, "phy-handle", 0);
if (!phy_dn)
-   return 0;
+   return NULL;
 
phydev = of_phy_find_device(phy_dn);
if (!phydev) {
-   err = -EPROBE_DEFER;
-   goto err_put_of;
+   of_node_put(phy_dn);
+   return ERR_PTR(-EPROBE_DEFER);
}
 
+   return phydev;
+}
+
+static int dsa_port_setup_phy_of(struct dsa_port *dp, bool enable)
+{
+   struct dsa_switch *ds = dp->ds;
+   struct phy_device *phydev;
+   int port = dp->index;
+   int err = 0;
+
+   phydev = dsa_port_get_phy_device(dp);
+   if (!phydev)
+   return 0;
+
+   if (IS_ERR(phydev))
+   return PTR_ERR(phydev);
+
if (enable) {
err = genphy_config_init(phydev);
if (err < 0)
@@ -317,8 +330,6 @@ static int dsa_port_setup_phy_of(struct dsa_port *dp, bool 
enable)
 
 err_put_dev:
put_device(&phydev->mdio.dev);
-err_put_of:
-   of_node_put(phy_dn);
return err;
 }
 
-- 
2.14.1



[PATCH net-next v2 7/7] net: dsa: b53: Add support for reading PHY statistics

2018-04-24 Thread Florian Fainelli
Allow the b53 driver to return PHY statistics when the CPU port used is
different than 5, 7 or 8, because those are typically PHY-less on most
devices. This is useful for debugging link problems between the switch
and an external host when using a non standard CPU port number (e.g: 4).

Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/b53/b53_common.c | 76 ++--
 drivers/net/dsa/b53/b53_priv.h   |  1 +
 drivers/net/dsa/bcm_sf2.c|  1 +
 3 files changed, 68 insertions(+), 10 deletions(-)

diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
index 726b2d8c6fe9..7faea467aca8 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -806,20 +806,39 @@ static unsigned int b53_get_mib_size(struct b53_device 
*dev)
return B53_MIBS_SIZE;
 }
 
+static struct phy_device *b53_get_phy_device(struct dsa_switch *ds, int port)
+{
+   /* These ports typically do not have built-in PHYs */
+   switch (port) {
+   case B53_CPU_PORT_25:
+   case 7:
+   case B53_CPU_PORT:
+   return NULL;
+   }
+
+   return mdiobus_get_phy(ds->slave_mii_bus, port);
+}
+
 void b53_get_strings(struct dsa_switch *ds, int port, u32 stringset,
 uint8_t *data)
 {
struct b53_device *dev = ds->priv;
const struct b53_mib_desc *mibs = b53_get_mib(dev);
unsigned int mib_size = b53_get_mib_size(dev);
+   struct phy_device *phydev;
unsigned int i;
 
-   if (stringset != ETH_SS_STATS)
-   return;
+   if (stringset == ETH_SS_STATS) {
+   for (i = 0; i < mib_size; i++)
+   strlcpy(data + i * ETH_GSTRING_LEN,
+   mibs[i].name, ETH_GSTRING_LEN);
+   } else if (stringset == ETH_SS_PHY_STATS) {
+   phydev = b53_get_phy_device(ds, port);
+   if (!phydev)
+   return;
 
-   for (i = 0; i < mib_size; i++)
-   strlcpy(data + i * ETH_GSTRING_LEN,
-   mibs[i].name, ETH_GSTRING_LEN);
+   phy_ethtool_get_strings(phydev, data);
+   }
 }
 EXPORT_SYMBOL(b53_get_strings);
 
@@ -856,14 +875,34 @@ void b53_get_ethtool_stats(struct dsa_switch *ds, int 
port, uint64_t *data)
 }
 EXPORT_SYMBOL(b53_get_ethtool_stats);
 
+void b53_get_ethtool_phy_stats(struct dsa_switch *ds, int port, uint64_t *data)
+{
+   struct phy_device *phydev;
+
+   phydev = b53_get_phy_device(ds, port);
+   if (!phydev)
+   return;
+
+   phy_ethtool_get_stats(phydev, NULL, data);
+}
+EXPORT_SYMBOL(b53_get_ethtool_phy_stats);
+
 int b53_get_sset_count(struct dsa_switch *ds, int port, int sset)
 {
struct b53_device *dev = ds->priv;
+   struct phy_device *phydev;
 
-   if (sset != ETH_SS_STATS)
-   return 0;
+   if (sset == ETH_SS_STATS) {
+   return b53_get_mib_size(dev);
+   } else if (sset == ETH_SS_PHY_STATS) {
+   phydev = b53_get_phy_device(ds, port);
+   if (!phydev)
+   return 0;
 
-   return b53_get_mib_size(dev);
+   return phy_ethtool_get_sset_count(phydev);
+   }
+
+   return 0;
 }
 EXPORT_SYMBOL(b53_get_sset_count);
 
@@ -1484,7 +1523,7 @@ void b53_br_fast_age(struct dsa_switch *ds, int port)
 }
 EXPORT_SYMBOL(b53_br_fast_age);
 
-static bool b53_can_enable_brcm_tags(struct dsa_switch *ds, int port)
+static bool b53_possible_cpu_port(struct dsa_switch *ds, int port)
 {
/* Broadcom switches will accept enabling Broadcom tags on the
 * following ports: 5, 7 and 8, any other port is not supported
@@ -1496,10 +1535,19 @@ static bool b53_can_enable_brcm_tags(struct dsa_switch 
*ds, int port)
return true;
}
 
-   dev_warn(ds->dev, "Port %d is not Broadcom tag capable\n", port);
return false;
 }
 
+static bool b53_can_enable_brcm_tags(struct dsa_switch *ds, int port)
+{
+   bool ret = b53_possible_cpu_port(ds, port);
+
+   if (!ret)
+   dev_warn(ds->dev, "Port %d is not Broadcom tag capable\n",
+port);
+   return ret;
+}
+
 enum dsa_tag_protocol b53_get_tag_protocol(struct dsa_switch *ds, int port)
 {
struct b53_device *dev = ds->priv;
@@ -1657,6 +1705,7 @@ static const struct dsa_switch_ops b53_switch_ops = {
.get_strings= b53_get_strings,
.get_ethtool_stats  = b53_get_ethtool_stats,
.get_sset_count = b53_get_sset_count,
+   .get_ethtool_phy_stats  = b53_get_ethtool_phy_stats,
.phy_read   = b53_phy_read16,
.phy_write  = b53_phy_write16,
.adjust_link= b53_adjust_link,
@@ -1961,6 +2010,13 @@ static int b53_switch_init(struct b53_device *dev)
dev->num_ports = dev->cpu_port + 1;
dev->enabled_ports |= BIT(dev->cpu_port);
 
+   /* Include non s

[PATCH net-next v2 0/7] net: Extend availability of PHY statistics

2018-04-24 Thread Florian Fainelli

Hi all,

This patch series adds support for retrieving PHY statistics with DSA switches
when the CPU port uses a PHY to PHY connection (as opposed to MAC to MAC).
To get there a number of things are done:

- first we move the code dealing with PHY statistics outside of 
net/core/ethtool.c
  and create helper functions since the same code will be reused
- then we allow network device drivers to provide an ethtool_get_phy_stats 
callback
  when the standard PHY library helpers are not suitable
- we update the DSA functions dealing with ethtool operations to get passed a
  stringset instead of assuming ETH_SS_STATS like they currently do
- then we provide a set of standard helpers within DSA as a framework and add
  the plumbing to allow retrieving the PHY statistics of the CPU port(s)
- finally plug support for retrieving such PHY statistics with the b53 driver

Changes in v2:

- got actual testing when the DSA master network device has a PHY that
  already provides statistics (thanks Nikita!)

- fixed the kbuild error reported when CONFIG_PHYLIB=n

- removed the checking of ops which is redundant and not needed

Florian Fainelli (7):
  net: Move PHY statistics code into PHY library helpers
  net: Allow network devices to have PHY statistics
  net: dsa: Do not check for ethtool_ops validity
  net: dsa: Pass stringset to ethtool operations
  net: dsa: Add helper function to obtain PHY device of a given port
  net: dsa: Allow providing PHY statistics from CPU port
  net: dsa: b53: Add support for reading PHY statistics

 drivers/net/dsa/b53/b53_common.c   | 79 ++---
 drivers/net/dsa/b53/b53_priv.h |  6 ++-
 drivers/net/dsa/bcm_sf2.c  |  1 +
 drivers/net/dsa/dsa_loop.c | 11 -
 drivers/net/dsa/lan9303-core.c | 11 -
 drivers/net/dsa/microchip/ksz_common.c | 11 -
 drivers/net/dsa/mt7530.c   | 11 -
 drivers/net/dsa/mv88e6xxx/chip.c   | 10 +++-
 drivers/net/dsa/qca8k.c| 10 +++-
 drivers/net/phy/phy.c  | 48 ++
 include/linux/ethtool.h|  5 ++
 include/linux/phy.h| 20 
 include/net/dsa.h  | 12 -
 net/core/ethtool.c | 61 ---
 net/dsa/master.c   | 62 +++
 net/dsa/port.c | 90 +-
 net/dsa/slave.c|  5 +-
 17 files changed, 366 insertions(+), 87 deletions(-)

-- 
2.14.1



[PATCH net-next v2 4/7] net: dsa: Pass stringset to ethtool operations

2018-04-24 Thread Florian Fainelli
Up until now we largely assumed that we were interested in ETH_SS_STATS
type of strings for all ethtool operations, this is about to change with
the introduction of additional string sets, e.g: ETH_SS_PHY_STATS.
Update all functions to take an appropriate stringset argument and act
on it when it is different than ETH_SS_STATS for now.

Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/b53/b53_common.c   | 11 +--
 drivers/net/dsa/b53/b53_priv.h |  5 +++--
 drivers/net/dsa/dsa_loop.c | 11 +--
 drivers/net/dsa/lan9303-core.c | 11 +--
 drivers/net/dsa/microchip/ksz_common.c | 11 +--
 drivers/net/dsa/mt7530.c   | 11 +--
 drivers/net/dsa/mv88e6xxx/chip.c   | 10 --
 drivers/net/dsa/qca8k.c| 10 --
 include/net/dsa.h  |  5 +++--
 net/dsa/master.c   | 21 +
 net/dsa/slave.c|  5 +++--
 11 files changed, 83 insertions(+), 28 deletions(-)

diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
index 78616787f2a3..726b2d8c6fe9 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -806,13 +806,17 @@ static unsigned int b53_get_mib_size(struct b53_device 
*dev)
return B53_MIBS_SIZE;
 }
 
-void b53_get_strings(struct dsa_switch *ds, int port, uint8_t *data)
+void b53_get_strings(struct dsa_switch *ds, int port, u32 stringset,
+uint8_t *data)
 {
struct b53_device *dev = ds->priv;
const struct b53_mib_desc *mibs = b53_get_mib(dev);
unsigned int mib_size = b53_get_mib_size(dev);
unsigned int i;
 
+   if (stringset != ETH_SS_STATS)
+   return;
+
for (i = 0; i < mib_size; i++)
strlcpy(data + i * ETH_GSTRING_LEN,
mibs[i].name, ETH_GSTRING_LEN);
@@ -852,10 +856,13 @@ void b53_get_ethtool_stats(struct dsa_switch *ds, int 
port, uint64_t *data)
 }
 EXPORT_SYMBOL(b53_get_ethtool_stats);
 
-int b53_get_sset_count(struct dsa_switch *ds, int port)
+int b53_get_sset_count(struct dsa_switch *ds, int port, int sset)
 {
struct b53_device *dev = ds->priv;
 
+   if (sset != ETH_SS_STATS)
+   return 0;
+
return b53_get_mib_size(dev);
 }
 EXPORT_SYMBOL(b53_get_sset_count);
diff --git a/drivers/net/dsa/b53/b53_priv.h b/drivers/net/dsa/b53/b53_priv.h
index 1187ebd79287..b933d5cb5c2d 100644
--- a/drivers/net/dsa/b53/b53_priv.h
+++ b/drivers/net/dsa/b53/b53_priv.h
@@ -286,9 +286,10 @@ static inline int b53_switch_get_reset_gpio(struct 
b53_device *dev)
 /* Exported functions towards other drivers */
 void b53_imp_vlan_setup(struct dsa_switch *ds, int cpu_port);
 int b53_configure_vlan(struct dsa_switch *ds);
-void b53_get_strings(struct dsa_switch *ds, int port, uint8_t *data);
+void b53_get_strings(struct dsa_switch *ds, int port, u32 stringset,
+uint8_t *data);
 void b53_get_ethtool_stats(struct dsa_switch *ds, int port, uint64_t *data);
-int b53_get_sset_count(struct dsa_switch *ds, int port);
+int b53_get_sset_count(struct dsa_switch *ds, int port, int sset);
 int b53_br_join(struct dsa_switch *ds, int port, struct net_device *bridge);
 void b53_br_leave(struct dsa_switch *ds, int port, struct net_device *bridge);
 void b53_br_set_stp_state(struct dsa_switch *ds, int port, u8 state);
diff --git a/drivers/net/dsa/dsa_loop.c b/drivers/net/dsa/dsa_loop.c
index f77be9f85cb3..9354cc08d3fd 100644
--- a/drivers/net/dsa/dsa_loop.c
+++ b/drivers/net/dsa/dsa_loop.c
@@ -86,16 +86,23 @@ static int dsa_loop_setup(struct dsa_switch *ds)
return 0;
 }
 
-static int dsa_loop_get_sset_count(struct dsa_switch *ds, int port)
+static int dsa_loop_get_sset_count(struct dsa_switch *ds, int port, int sset)
 {
+   if (sset != ETH_SS_STATS)
+   return 0;
+
return __DSA_LOOP_CNT_MAX;
 }
 
-static void dsa_loop_get_strings(struct dsa_switch *ds, int port, uint8_t 
*data)
+static void dsa_loop_get_strings(struct dsa_switch *ds, int port,
+u32 stringset, uint8_t *data)
 {
struct dsa_loop_priv *ps = ds->priv;
unsigned int i;
 
+   if (stringset != ETH_SS_STATS)
+   return;
+
for (i = 0; i < __DSA_LOOP_CNT_MAX; i++)
memcpy(data + i * ETH_GSTRING_LEN,
   ps->ports[port].mib[i].name, ETH_GSTRING_LEN);
diff --git a/drivers/net/dsa/lan9303-core.c b/drivers/net/dsa/lan9303-core.c
index fefa454f3e56..b4f6e1a67dd9 100644
--- a/drivers/net/dsa/lan9303-core.c
+++ b/drivers/net/dsa/lan9303-core.c
@@ -977,10 +977,14 @@ static const struct lan9303_mib_desc lan9303_mib[] = {
{ .offset = LAN9303_MAC_TX_LATECOL_0, .name = "TxLateCol", },
 };
 
-static void lan9303_get_strings(struct dsa_switch *ds, int port, uint8_t *data)
+static void lan9303_get_strings(struct dsa_switch *ds, int port,
+ 

[PATCH net-next v2 3/7] net: dsa: Do not check for ethtool_ops validity

2018-04-24 Thread Florian Fainelli
This is completely redundant with what netdev_set_default_ethtool_ops()
does, we are always guaranteed to have a valid dev->ethtool_ops pointer,
however, within that structure, not all function calls may be populated,
so we still have to check them individually.

Signed-off-by: Florian Fainelli 
---
 net/dsa/master.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/dsa/master.c b/net/dsa/master.c
index 90e6df0351eb..9ec16b39ed15 100644
--- a/net/dsa/master.c
+++ b/net/dsa/master.c
@@ -22,7 +22,7 @@ static void dsa_master_get_ethtool_stats(struct net_device 
*dev,
int port = cpu_dp->index;
int count = 0;
 
-   if (ops && ops->get_sset_count && ops->get_ethtool_stats) {
+   if (ops->get_sset_count && ops->get_ethtool_stats) {
count = ops->get_sset_count(dev, ETH_SS_STATS);
ops->get_ethtool_stats(dev, stats, data);
}
@@ -38,7 +38,7 @@ static int dsa_master_get_sset_count(struct net_device *dev, 
int sset)
struct dsa_switch *ds = cpu_dp->ds;
int count = 0;
 
-   if (ops && ops->get_sset_count)
+   if (ops->get_sset_count)
count += ops->get_sset_count(dev, sset);
 
if (sset == ETH_SS_STATS && ds->ops->get_sset_count)
@@ -64,7 +64,7 @@ static void dsa_master_get_strings(struct net_device *dev, 
uint32_t stringset,
/* We do not want to be NULL-terminated, since this is a prefix */
pfx[sizeof(pfx) - 1] = '_';
 
-   if (ops && ops->get_sset_count && ops->get_strings) {
+   if (ops->get_sset_count && ops->get_strings) {
mcount = ops->get_sset_count(dev, ETH_SS_STATS);
ops->get_strings(dev, stringset, data);
}
-- 
2.14.1



[PATCH net-next v2 1/7] net: Move PHY statistics code into PHY library helpers

2018-04-24 Thread Florian Fainelli
In order to make it possible for network device drivers that do not
necessarily have a phy_device attached, but still report PHY statistics,
have a preliminary refactoring consisting in creating helper functions
that encapsulate the PHY device driver knowledge within PHYLIB.

Signed-off-by: Florian Fainelli 
---
 drivers/net/phy/phy.c | 48 
 include/linux/phy.h   | 20 
 net/core/ethtool.c| 38 --
 3 files changed, 76 insertions(+), 30 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 05c1e8ef15e6..a98ed12c0009 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -1277,3 +1277,51 @@ int phy_ethtool_nway_reset(struct net_device *ndev)
return phy_restart_aneg(phydev);
 }
 EXPORT_SYMBOL(phy_ethtool_nway_reset);
+
+int phy_ethtool_get_strings(struct phy_device *phydev, u8 *data)
+{
+   if (!phydev->drv)
+   return -EIO;
+
+   mutex_lock(&phydev->lock);
+   phydev->drv->get_strings(phydev, data);
+   mutex_unlock(&phydev->lock);
+
+   return 0;
+}
+EXPORT_SYMBOL(phy_ethtool_get_strings);
+
+int phy_ethtool_get_sset_count(struct phy_device *phydev)
+{
+   int ret;
+
+   if (!phydev->drv)
+   return -EIO;
+
+   if (phydev->drv->get_sset_count &&
+   phydev->drv->get_strings &&
+   phydev->drv->get_stats) {
+   mutex_lock(&phydev->lock);
+   ret = phydev->drv->get_sset_count(phydev);
+   mutex_unlock(&phydev->lock);
+
+   return ret;
+   }
+
+   return -EOPNOTSUPP;
+}
+EXPORT_SYMBOL(phy_ethtool_get_sset_count);
+
+int phy_ethtool_get_stats(struct phy_device *phydev,
+ struct ethtool_stats *stats, u64 *data)
+{
+   if (!phydev->drv)
+   return -EIO;
+
+   mutex_lock(&phydev->lock);
+   phydev->drv->get_stats(phydev, stats, data);
+   mutex_unlock(&phydev->lock);
+
+   return 0;
+}
+EXPORT_SYMBOL(phy_ethtool_get_stats);
diff --git a/include/linux/phy.h b/include/linux/phy.h
index f0b5870a6d40..6ca81395c545 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -1066,6 +1066,26 @@ int phy_ethtool_nway_reset(struct net_device *ndev);
 #if IS_ENABLED(CONFIG_PHYLIB)
 int __init mdio_bus_init(void);
 void mdio_bus_exit(void);
+int phy_ethtool_get_strings(struct phy_device *phydev, u8 *data);
+int phy_ethtool_get_sset_count(struct phy_device *phydev);
+int phy_ethtool_get_stats(struct phy_device *phydev,
+ struct ethtool_stats *stats, u64 *data);
+#else
+int phy_ethtool_get_strings(struct phy_device *phydev, u8 *data)
+{
+   return -EOPNOTSUPP;
+}
+
+int phy_ethtool_get_sset_count(struct phy_device *phydev)
+{
+   return -EOPNOTSUPP;
+}
+
+int phy_ethtool_get_stats(struct phy_device *phydev,
+ struct ethtool_stats *stats, u64 *data)
+{
+   return -EOPNOTSUPP;
+}
 #endif
 
 extern struct bus_type mdio_bus_type;
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 03416e6dd5d7..f0d42e093c4a 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -210,23 +210,6 @@ static int ethtool_set_features(struct net_device *dev, 
void __user *useraddr)
return ret;
 }
 
-static int phy_get_sset_count(struct phy_device *phydev)
-{
-   int ret;
-
-   if (phydev->drv->get_sset_count &&
-   phydev->drv->get_strings &&
-   phydev->drv->get_stats) {
-   mutex_lock(&phydev->lock);
-   ret = phydev->drv->get_sset_count(phydev);
-   mutex_unlock(&phydev->lock);
-
-   return ret;
-   }
-
-   return -EOPNOTSUPP;
-}
-
 static int __ethtool_get_sset_count(struct net_device *dev, int sset)
 {
const struct ethtool_ops *ops = dev->ethtool_ops;
@@ -245,7 +228,7 @@ static int __ethtool_get_sset_count(struct net_device *dev, 
int sset)
 
if (sset == ETH_SS_PHY_STATS) {
if (dev->phydev)
-   return phy_get_sset_count(dev->phydev);
+   return phy_ethtool_get_sset_count(dev->phydev);
else
return -EOPNOTSUPP;
}
@@ -272,15 +255,10 @@ static void __ethtool_get_strings(struct net_device *dev,
else if (stringset == ETH_SS_PHY_TUNABLES)
memcpy(data, phy_tunable_strings, sizeof(phy_tunable_strings));
else if (stringset == ETH_SS_PHY_STATS) {
-   struct phy_device *phydev = dev->phydev;
-
-   if (phydev) {
-   mutex_lock(&phydev->lock);
-   phydev->drv->get_strings(phydev, data);
-   mutex_unlock(&phydev->lock);
-   } else {
+   if (dev->phydev)
+   phy_ethtool_get_strings(dev->phydev, data);
+   else
return;
-   }
} else
/* 

[PATCH net-next v2 2/7] net: Allow network devices to have PHY statistics

2018-04-24 Thread Florian Fainelli
Add a new callback: get_ethtool_phy_stats() which allows network device
drivers not making use of the PHY library to return PHY statistics.
Update ethtool_get_phy_stats(), __ethtool_get_sset_count() and
__ethtool_get_strings() accordingly to interogate the network device
about ETH_SS_PHY_STATS.

Signed-off-by: Florian Fainelli 
---
 include/linux/ethtool.h |  5 +
 net/core/ethtool.c  | 39 +--
 2 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index ebe41811ed34..9b19556f0156 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -310,6 +310,9 @@ bool ethtool_convert_link_mode_to_legacy_u32(u32 
*legacy_u32,
  * fields should be ignored (use %__ETHTOOL_LINK_MODE_MASK_NBITS
  * instead of the latter), any change to them will be overwritten
  * by kernel. Returns a negative error code or zero.
+ * @get_ethtool_phy_stats: Return extended statistics about the PHY device.
+ * This is only useful if the device maintains PHY statistics and
+ * cannot use the standard PHY library helpers.
  *
  * All operations are optional (i.e. the function pointer may be set
  * to %NULL) and callers must take this into account.  Callers must
@@ -405,5 +408,7 @@ struct ethtool_ops {
  struct ethtool_fecparam *);
int (*set_fecparam)(struct net_device *,
  struct ethtool_fecparam *);
+   void(*get_ethtool_phy_stats)(struct net_device *,
+struct ethtool_stats *, u64 *);
 };
 #endif /* _LINUX_ETHTOOL_H */
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index f0d42e093c4a..4b8992ccf904 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -226,12 +226,9 @@ static int __ethtool_get_sset_count(struct net_device 
*dev, int sset)
if (sset == ETH_SS_PHY_TUNABLES)
return ARRAY_SIZE(phy_tunable_strings);
 
-   if (sset == ETH_SS_PHY_STATS) {
-   if (dev->phydev)
-   return phy_ethtool_get_sset_count(dev->phydev);
-   else
-   return -EOPNOTSUPP;
-   }
+   if (sset == ETH_SS_PHY_STATS && dev->phydev &&
+   !ops->get_ethtool_phy_stats)
+   return phy_ethtool_get_sset_count(dev->phydev);
 
if (ops->get_sset_count && ops->get_strings)
return ops->get_sset_count(dev, sset);
@@ -254,12 +251,10 @@ static void __ethtool_get_strings(struct net_device *dev,
memcpy(data, tunable_strings, sizeof(tunable_strings));
else if (stringset == ETH_SS_PHY_TUNABLES)
memcpy(data, phy_tunable_strings, sizeof(phy_tunable_strings));
-   else if (stringset == ETH_SS_PHY_STATS) {
-   if (dev->phydev)
-   phy_ethtool_get_strings(dev->phydev, data);
-   else
-   return;
-   } else
+   else if (stringset == ETH_SS_PHY_STATS && dev->phydev &&
+!ops->get_ethtool_phy_stats)
+   phy_ethtool_get_strings(dev->phydev, data);
+   else
/* ops->get_strings is valid because checked earlier */
ops->get_strings(dev, stringset, data);
 }
@@ -1971,15 +1966,19 @@ static int ethtool_get_stats(struct net_device *dev, 
void __user *useraddr)
 
 static int ethtool_get_phy_stats(struct net_device *dev, void __user *useraddr)
 {
-   struct ethtool_stats stats;
+   const struct ethtool_ops *ops = dev->ethtool_ops;
struct phy_device *phydev = dev->phydev;
+   struct ethtool_stats stats;
u64 *data;
int ret, n_stats;
 
-   if (!phydev)
+   if (!phydev && (!ops->get_ethtool_phy_stats || !ops->get_sset_count))
return -EOPNOTSUPP;
 
-   n_stats = phy_ethtool_get_sset_count(dev->phydev);
+   if (dev->phydev && !ops->get_ethtool_phy_stats)
+   n_stats = phy_ethtool_get_sset_count(dev->phydev);
+   else
+   n_stats = ops->get_sset_count(dev, ETH_SS_PHY_STATS);
if (n_stats < 0)
return n_stats;
if (n_stats > S32_MAX / sizeof(u64))
@@ -1994,9 +1993,13 @@ static int ethtool_get_phy_stats(struct net_device *dev, 
void __user *useraddr)
if (n_stats && !data)
return -ENOMEM;
 
-   ret = phy_ethtool_get_stats(dev->phydev, &stats, data);
-   if (ret < 0)
-   return ret;
+   if (dev->phydev && !ops->get_ethtool_phy_stats) {
+   ret = phy_ethtool_get_stats(dev->phydev, &stats, data);
+   if (ret < 0)
+   return ret;
+   } else {
+   ops->get_ethtool_phy_stats(dev, &stats, data);
+   }
 
ret = -EFAULT;
if (copy_to_user(useraddr, &stats, sizeof(stats)))
-- 
2.14.1



Re: [PATCH net-next v2 0/7] net: Extend availability of PHY statistics

2018-04-24 Thread Florian Fainelli
On 04/24/2018 05:27 PM, Florian Fainelli wrote:
> Hi all,
> 
> This patch series adds support for retrieving PHY statistics with DSA switches
> when the CPU port uses a PHY to PHY connection (as opposed to MAC to MAC).
> To get there a number of things are done:
> 
> - first we move the code dealing with PHY statistics outside of 
> net/core/ethtool.c
>   and create helper functions since the same code will be reused
> - then we allow network device drivers to provide an ethtool_get_phy_stats 
> callback
>   when the standard PHY library helpers are not suitable
> - we update the DSA functions dealing with ethtool operations to get passed a
>   stringset instead of assuming ETH_SS_STATS like they currently do
> - then we provide a set of standard helpers within DSA as a framework and add
>   the plumbing to allow retrieving the PHY statistics of the CPU port(s)
> - finally plug support for retrieving such PHY statistics with the b53 driver
> 
> Changes in v2:
> 
> - got actual testing when the DSA master network device has a PHY that
>   already provides statistics (thanks Nikita!)
> 
> - fixed the kbuild error reported when CONFIG_PHYLIB=n
> 
> - removed the checking of ops which is redundant and not needed

David, sorry looks like there will be a v3, the last patch introduces a
possible problem with bcm_sf2 which uses b53_common as a library, will
respin later.
-- 
Florian


Re: [RFC PATCH ghak32 V2 01/13] audit: add container id

2018-04-24 Thread Richard Guy Briggs
On 2018-04-24 15:01, Paul Moore wrote:
> On Mon, Apr 23, 2018 at 10:02 PM, Richard Guy Briggs  wrote:
> > On 2018-04-23 19:15, Paul Moore wrote:
> >> On Sat, Apr 21, 2018 at 10:34 AM, Richard Guy Briggs  
> >> wrote:
> >> > On 2018-04-18 19:47, Paul Moore wrote:
> >> >> On Fri, Mar 16, 2018 at 5:00 AM, Richard Guy Briggs  
> >> >> wrote:
> >> >> > Implement the proc fs write to set the audit container ID of a 
> >> >> > process,
> >> >> > emitting an AUDIT_CONTAINER record to document the event.
> >> >> >
> >> >> > This is a write from the container orchestrator task to a proc entry 
> >> >> > of
> >> >> > the form /proc/PID/containerid where PID is the process ID of the 
> >> >> > newly
> >> >> > created task that is to become the first task in a container, or an
> >> >> > additional task added to a container.
> >> >> >
> >> >> > The write expects up to a u64 value (unset: 18446744073709551615).
> >> >> >
> >> >> > This will produce a record such as this:
> >> >> > type=CONTAINER msg=audit(1519903238.968:261): op=set pid=596 uid=0 
> >> >> > subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 
> >> >> > tty=pts0 ses=1 opid=596 old-contid=18446744073709551615 contid=123455 
> >> >> > res=0
> >> >> >
> >> >> > The "op" field indicates an initial set.  The "pid" to "ses" fields 
> >> >> > are
> >> >> > the orchestrator while the "opid" field is the object's PID, the 
> >> >> > process
> >> >> > being "contained".  Old and new container ID values are given in the
> >> >> > "contid" fields, while res indicates its success.
> >> >> >
> >> >> > It is not permitted to self-set, unset or re-set the container ID.  A
> >> >> > child inherits its parent's container ID, but then can be set only 
> >> >> > once
> >> >> > after.
> >> >> >
> >> >> > See: https://github.com/linux-audit/audit-kernel/issues/32
> >> >> >
> >> >> > Signed-off-by: Richard Guy Briggs 
> >> >> > ---
> >> >> >  fs/proc/base.c | 37 
> >> >> >  include/linux/audit.h  | 16 +
> >> >> >  include/linux/init_task.h  |  4 ++-
> >> >> >  include/linux/sched.h  |  1 +
> >> >> >  include/uapi/linux/audit.h |  2 ++
> >> >> >  kernel/auditsc.c   | 84 
> >> >> > ++
> >> >> >  6 files changed, 143 insertions(+), 1 deletion(-)
> 
> ...
> 
> >> >> >  /* audit_rule_data supports filter rules with both integer and string
> >> >> >   * fields.  It corresponds with AUDIT_ADD_RULE, AUDIT_DEL_RULE and
> >> >> > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> >> >> > index 4e0a4ac..29c8482 100644
> >> >> > --- a/kernel/auditsc.c
> >> >> > +++ b/kernel/auditsc.c
> >> >> > @@ -2073,6 +2073,90 @@ int audit_set_loginuid(kuid_t loginuid)
> >> >> > return rc;
> >> >> >  }
> >> >> >
> >> >> > +static int audit_set_containerid_perm(struct task_struct *task, u64 
> >> >> > containerid)
> >> >> > +{
> >> >> > +   struct task_struct *parent;
> >> >> > +   u64 pcontainerid, ccontainerid;
> >> >> > +
> >> >> > +   /* Don't allow to set our own containerid */
> >> >> > +   if (current == task)
> >> >> > +   return -EPERM;
> >> >>
> >> >> Why not?  Is there some obvious security concern that I missing?
> >> >
> >> > We then lose the distinction in the AUDIT_CONTAINER record between the
> >> > initiating PID and the target PID.  This was outlined in the proposal.
> >>
> >> I just went back and reread the v3 proposal and I still don't see a
> >> good explanation of this.  Why is this bad?  What's the security
> >> concern?
> >
> > I don't remember, specifically.  Maybe this has been addressed by the
> > check for children/threads or identical parent container ID.  So, I'm
> > reluctantly willing to remove that check for now.
> 
> Okay.  For the record, if someone can explain to me why this
> restriction saves us from some terrible situation I'm all for leaving
> it.  I'm just opposed to restrictions without solid reasoning behind
> them.
> 
> >> > Having said that, I'm still not sure we have protected sufficiently from
> >> > a child turning around and setting it's parent's as yet unset or
> >> > inherited audit container ID.
> >>
> >> Yes, I believe we only want to let a task set the audit container for
> >> it's children (or itself/threads if we decide to allow that, see
> >> above).  There *has* to be a function to check to see if a task if a
> >> child of a given task ... right? ... although this is likely to be a
> >> pointer traversal and locking nightmare ... hmmm.
> >
> > Isn't that just (struct task_struct)parent == (struct
> > task_struct)child->parent (or ->real_parent)?
> >
> > And now that I say that, it is covered by the following patch's child
> > check, so as long as we keep that, we should be fine.
> 
> I was thinking of checking not just current's immediate children, but
> any of it's descendants as I believe that is what we want to limit,
> yes?  I just worry that it isn't really practical to perform that
> check.

Th

Re: [PATCH] net: phy: allow scanning busses with missing phys

2018-04-24 Thread Florian Fainelli
On 04/24/2018 11:01 AM, Andrew Lunn wrote:
> On Tue, Apr 24, 2018 at 09:37:09AM -0700, Florian Fainelli wrote:
>>
>>
>> On 04/24/2018 09:09 AM, Alexandre Belloni wrote:
>>> Some MDIO busses will error out when trying to read a phy address with no
>>> phy present at that address. In that case, probing the bus will fail
>>> because __mdiobus_register() is scanning the bus for all possible phys
>>> addresses.
>>>
>>> In case MII_PHYSID1 returns -EIO or -ENODEV, consider there is no phy at
>>> this address and set the phy ID to 0x which is then properly
>>> handled in get_phy_device().
>>
>> Humm, why not have your MDIO bus implementation do the scanning itself
>> in a reset() callback, which happens before probing the bus, and based
>> on the results, set phy_mask accordingly such that only PHYs present are
>> populated?
> 
> Hi Florian
> 
> Seems a bit odd have the driver do this, when the core could.

I could see arguments for doing it in either location, if this is done
in the driver, the driver is responsible for knowing what addresses will
respond, which does not seem to big of a stretch. If done by core, we
lift the driver from taking care of that. Either way is fine, really.

> 
>> My only concern with your change is that we are having a special
>> treatment for EIO and ENODEV, so we must make sure MDIO bus drivers are
>> all conforming to that.
>  
> I don't see how it could be a problem. It used to be any error was a
> real error, and would stop the bus from loading. It now means there is
> nothing there. The only possible side effect is an mdio bus driver
> might remain loaded without any devices if all reads return
> ENODEV/EIO, were as before it would probably never load. 

Right, though that does not seem to be a big problem.

> 
>   Andrew
> 


-- 
Florian


Re: [PATCH] net: phy: allow scanning busses with missing phys

2018-04-24 Thread Florian Fainelli
On 04/24/2018 09:09 AM, Alexandre Belloni wrote:
> Some MDIO busses will error out when trying to read a phy address with no
> phy present at that address. In that case, probing the bus will fail
> because __mdiobus_register() is scanning the bus for all possible phys
> addresses.
> 
> In case MII_PHYSID1 returns -EIO or -ENODEV, consider there is no phy at
> this address and set the phy ID to 0x which is then properly
> handled in get_phy_device().
> 
> Suggested-by: Andrew Lunn 
> Signed-off-by: Alexandre Belloni 

Reviewed-by: Florian Fainelli 
-- 
Florian


[PATCH net-next] net: rules: Move l3mdev attribute validation to a helper

2018-04-24 Thread David Ahern
Move the check on FRA_L3MDEV attribute to helper to improve the
readability of fib_nl2rule. Update the extack messages to be
clear when the configuration option is disabled versus an invalid
value has been passed.

Signed-off-by: David Ahern 
---
 net/core/fib_rules.c | 34 --
 1 file changed, 24 insertions(+), 10 deletions(-)

diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 2271c80fd967..126ffc5bc630 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -454,6 +454,27 @@ static struct fib_rule *rule_find(struct fib_rules_ops 
*ops,
return NULL;
 }
 
+#ifdef CONFIG_NET_L3_MASTER_DEV
+static int fib_nl2rule_l3mdev(struct nlattr *nla, struct fib_rule *nlrule,
+ struct netlink_ext_ack *extack)
+{
+   nlrule->l3mdev = nla_get_u8(nla);
+   if (nlrule->l3mdev != 1) {
+   NL_SET_ERR_MSG(extack, "Invalid l3mdev attribute");
+   return -1;
+   }
+
+   return 0;
+}
+#else
+static int fib_nl2rule_l3mdev(struct nlattr *nla, struct fib_rule *nlrule,
+ struct netlink_ext_ack *extack)
+{
+   NL_SET_ERR_MSG(extack, "l3mdev support is not enabled in kernel");
+   return -1;
+}
+#endif
+
 static int fib_nl2rule(struct sk_buff *skb, struct nlmsghdr *nlh,
   struct netlink_ext_ack *extack,
   struct fib_rules_ops *ops,
@@ -536,16 +557,9 @@ static int fib_nl2rule(struct sk_buff *skb, struct 
nlmsghdr *nlh,
nlrule->tun_id = nla_get_be64(tb[FRA_TUN_ID]);
 
err = -EINVAL;
-   if (tb[FRA_L3MDEV]) {
-#ifdef CONFIG_NET_L3_MASTER_DEV
-   nlrule->l3mdev = nla_get_u8(tb[FRA_L3MDEV]);
-   if (nlrule->l3mdev != 1)
-#endif
-   {
-   NL_SET_ERR_MSG(extack, "Invalid l3mdev");
-   goto errout_free;
-   }
-   }
+   if (tb[FRA_L3MDEV] &&
+   fib_nl2rule_l3mdev(tb[FRA_L3MDEV], nlrule, extack) < 0)
+   goto errout_free;
 
nlrule->action = frh->action;
nlrule->flags = frh->flags;
-- 
2.11.0



general protection fault in smc_set_keepalive

2018-04-24 Thread syzbot

Hello,

syzbot hit the following crash on net-next commit
9c20b9372fbaf6f7d4c05f5f925806a7928f0c73 (Tue Apr 24 03:08:41 2018 +)
net: fib_rules: fix l3mdev netlink attr processing
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=cf9012c597c8379d535c


So far this crash happened 2 times on net-next.
C reproducer: https://syzkaller.appspot.com/x/repro.c?id=4775309383565312
syzkaller reproducer:  
https://syzkaller.appspot.com/x/repro.syz?id=4978230683500544
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=4770663504019456
Kernel config:  
https://syzkaller.appspot.com/x/.config?id=-2918904850634584293

compiler: gcc (GCC) 8.0.1 20180413 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+cf9012c597c8379d5...@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for  
details.

If you forward the report, please keep this part and the footer.

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault:  [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 4455 Comm: syz-executor060 Not tainted 4.17.0-rc1+ #17
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011

RIP: 0010:smc_set_keepalive+0x4e/0xd0 net/smc/af_smc.c:59
RSP: 0018:8801ced8fa68 EFLAGS: 00010202
RAX: dc00 RBX:  RCX: 85d72bcb
RDX: 0004 RSI: 873f0a94 RDI: 0020
RBP: 8801ced8fa80 R08: 8801b67e44c0 R09: 0006
R10: 8801b67e44c0 R11:  R12: 8801b6bff7c0
R13: 0001 R14: 0003 R15: 8801aee2b540
FS:  009e4880() GS:8801dae0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 2040 CR3: 0001b6e75000 CR4: 001406f0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 sock_setsockopt+0x14e2/0x1fe0 net/core/sock.c:801
 __sys_setsockopt+0x2df/0x390 net/socket.c:1899
 __do_sys_setsockopt net/socket.c:1914 [inline]
 __se_sys_setsockopt net/socket.c:1911 [inline]
 __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1911
 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x43fcf9
RSP: 002b:7ffe62977a78 EFLAGS: 0217 ORIG_RAX: 0036
RAX: ffda RBX: 004002c8 RCX: 0043fcf9
RDX: 0009 RSI: 0001 RDI: 0003
RBP: 006ca018 R08: 0004 R09: 004002c8
R10: 2040 R11: 0217 R12: 00401620
R13: 004016b0 R14:  R15: 
Code: ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 75 78 48 8b 9b 50 04 00 00 48  
b8 00 00 00 00 00 fc ff df 48 8d 7b 20 48 89 fa 48 c1 ea 03 <80> 3c 02 00  
75 6b 48 b8 00 00 00 00 00 fc ff df 48 8b 5b 20 48

RIP: smc_set_keepalive+0x4e/0xd0 net/smc/af_smc.c:59 RSP: 8801ced8fa68
---[ end trace a76f9ed0fb111068 ]---


---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to syzkal...@googlegroups.com.

syzbot will keep track of this bug report.
If you forgot to add the Reported-by tag, once the fix for this bug is  
merged

into any tree, please reply to this email with:
#syz fix: exact-commit-title
If you want to test a patch for this bug, please reply with:
#syz test: git://repo/address.git branch
and provide the patch inline or as an attachment.
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug  
report.

Note: all commands must start from beginning of the line in the email body.


Re: [PATCH 2/5] ide: kill ide_toggle_bounce

2018-04-24 Thread Jens Axboe
On 4/24/18 12:16 PM, Christoph Hellwig wrote:
> ide_toggle_bounce did select various strange block bounce limits, including
> not bouncing at all as soon as an iommu is present in the system.  Given
> that the dma_map routines now handle any required bounce buffering except
> for ISA DMA, and the ide code already must handle either ISA DMA or highmem
> at least for iommu equipped systems we can get rid of the block layer
> bounce limit setting entirely.

Pretty sure I was the one to add this code, when highmem page IO was
enabled about two decades ago...

Outside of DMA, the issue was that the PIO code could not handle
highmem. That's not the case anymore, so this should be fine.

Reviewed-by: Jens Axboe 

-- 
Jens Axboe



[PATCH] [PATCH bpf-next] samples/bpf/bpf_load.c: remove redundant ret assignment in bpf_load_program()

2018-04-24 Thread Wang Sheng-Hui
2 redundant ret assignments removded:
* 'ret = 1' before the logic 'if (data_maps)', and if any errors jump to
  label 'done'. No 'ret = 1' needed before the error jump.
* After the '/* load programs */' part, if everything goes well, then
  the BPF code will be loaded and 'ret' set to 0 by load_and_attach().
  If something goes wrong, 'ret' set to none-O, the redundant 'ret = 0'
  after the for clause will make the error skipped.
  For example, if some BPF code cannot provide supported program types
  in ELF SEC("unknown"), the for clause will not call load_and_attach()
  to load the BPF code. 1 should be returned to callees instead of 0.

Signed-off-by: Wang Sheng-Hui 
---
 samples/bpf/bpf_load.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index bebe4188b4b3..feca497d6afd 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -549,7 +549,6 @@ static int do_load_bpf_file(const char *path, fixup_map_cb 
fixup_map)
if (nr_maps < 0) {
printf("Error: Failed loading ELF maps (errno:%d):%s\n",
   nr_maps, strerror(-nr_maps));
-   ret = 1;
goto done;
}
if (load_maps(map_data, nr_maps, fixup_map))
@@ -615,7 +614,6 @@ static int do_load_bpf_file(const char *path, fixup_map_cb 
fixup_map)
}
}
 
-   ret = 0;
 done:
close(fd);
return ret;
-- 
2.17.0





Re: [PATCH net-next v2 0/7] net: Extend availability of PHY statistics

2018-04-24 Thread David Miller
From: Florian Fainelli 
Date: Tue, 24 Apr 2018 17:35:09 -0700

> David, sorry looks like there will be a v3, the last patch introduces a
> possible problem with bcm_sf2 which uses b53_common as a library, will
> respin later.

Ok, no problem.


Re: [PATCH] net: ethernet: ave: fix ave_start_xmit()'s return type

2018-04-24 Thread Kunihiko Hayashi
Hi,

On Tue, 24 Apr 2018 15:17:25 +0200
Luc Van Oostenryck  wrote:

> The method ndo_start_xmit() is defined as returning an 'netdev_tx_t',
> which is a typedef for an enum type, but the implementation in this
> driver returns an 'int'.
> 
> Fix this by returning 'netdev_tx_t' in this driver too.
> 
> Signed-off-by: Luc Van Oostenryck 

Thanks for fixing.
I think it's preferable to add 'Fixes'.

Fixes: 4c270b55a5af ("net: ethernet: socionext: add AVE ethernet driver")
Acked-by: Kunihiko Hayashi 

Thank you,

> ---
>  drivers/net/ethernet/socionext/sni_ave.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/socionext/sni_ave.c 
> b/drivers/net/ethernet/socionext/sni_ave.c
> index 0b3b7a460..95625bca1 100644
> --- a/drivers/net/ethernet/socionext/sni_ave.c
> +++ b/drivers/net/ethernet/socionext/sni_ave.c
> @@ -1360,7 +1360,7 @@ static int ave_stop(struct net_device *ndev)
>   return 0;
>  }
>  
> -static int ave_start_xmit(struct sk_buff *skb, struct net_device *ndev)
> +static netdev_tx_t ave_start_xmit(struct sk_buff *skb, struct net_device 
> *ndev)
>  {
>   struct ave_private *priv = netdev_priv(ndev);
>   u32 proc_idx, done_idx, ndesc, cmdsts;
> -- 
> 2.17.0

---
Best Regards,
Kunihiko Hayashi




[PATCH net-next 0/4] nfp: flower tc block support and nfp PCI updates

2018-04-24 Thread Jakub Kicinski
This series improves the nfp PCIe code by making use of the new
pcie_print_link_status() helper and resetting NFP locks when
driver loads.  This can help us avoid lock ups after host crashes
and is rebooted with PCIe reset or when kdump kernel is loaded.

The flower changes come from John, he says:

This patchset fixes offload issues when multiple repr netdevs are bound to
a tc block and filter rules added. Previously the rule would be passed to
the reprs and would be rejected in all but the first as the cookie value
will indicate a duplicate. The first patch extends the flow lookup
function to consider both host context and ingress netdev along with the
cookie value. This means that a rule with a given cookie can exist
multiple times assuming the ingress netdev is different. The host context
ensures that stats from fw are associated with the correct instance of the
rule.

The second patch protects against rejecting add/del/stat messages when a
rule has a repr as both an ingress port and an egress dev. In such cases a
callback can be triggered twice (once for ingress and once for egress)
and can lead to duplicate rule detection or incorrect double calls.


Jakub Kicinski (2):
  nfp: reset local locks on init
  nfp: print PCIe link bandwidth on probe

John Hurley (2):
  nfp: flower: support offloading multiple rules with same cookie
  nfp: flower: ignore duplicate cb requests for same rule

 drivers/net/ethernet/netronome/nfp/flower/main.h   |  9 +++-
 .../net/ethernet/netronome/nfp/flower/metadata.c   | 20 +---
 .../net/ethernet/netronome/nfp/flower/offload.c| 50 ++
 drivers/net/ethernet/netronome/nfp/nfp_main.c  |  5 ++
 drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h   |  2 +
 .../ethernet/netronome/nfp/nfpcore/nfp6000_pcie.c  |  1 +
 .../net/ethernet/netronome/nfp/nfpcore/nfp_cpp.h   |  2 +
 .../net/ethernet/netronome/nfp/nfpcore/nfp_mutex.c | 45 +
 .../ethernet/netronome/nfp/nfpcore/nfp_resource.c  | 59 ++
 9 files changed, 173 insertions(+), 20 deletions(-)

-- 
2.16.2



[PATCH net-next 1/4] nfp: reset local locks on init

2018-04-24 Thread Jakub Kicinski
NFP locks record the owner when held, for PCIe devices the owner
ID will be the PCIe link number.  When driver loads it should scan
known locks and if they indicate that they are held by local
endpoint but the driver doesn't hold them - release them.

Locks can be left taken for instance when kernel gets kexec-ed or
after a crash.  Management FW tries to clean up stale locks too,
but it currently depends on PCIe link going down which doesn't
always happen.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Dirk van der Merwe 
---
 drivers/net/ethernet/netronome/nfp/nfp_main.c  |  5 ++
 drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h   |  2 +
 .../net/ethernet/netronome/nfp/nfpcore/nfp_cpp.h   |  2 +
 .../net/ethernet/netronome/nfp/nfpcore/nfp_mutex.c | 45 +
 .../ethernet/netronome/nfp/nfpcore/nfp_resource.c  | 59 ++
 5 files changed, 113 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_main.c 
b/drivers/net/ethernet/netronome/nfp/nfp_main.c
index 6a3e3231e111..c8aef9508109 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_main.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_main.c
@@ -489,6 +489,10 @@ static int nfp_pci_probe(struct pci_dev *pdev,
goto err_disable_msix;
}
 
+   err = nfp_resource_table_init(pf->cpp);
+   if (err)
+   goto err_cpp_free;
+
pf->hwinfo = nfp_hwinfo_read(pf->cpp);
 
dev_info(&pdev->dev, "Assembly: %s%s%s-%s CPLD: %s\n",
@@ -551,6 +555,7 @@ static int nfp_pci_probe(struct pci_dev *pdev,
vfree(pf->dumpspec);
 err_hwinfo_free:
kfree(pf->hwinfo);
+err_cpp_free:
nfp_cpp_free(pf->cpp);
 err_disable_msix:
destroy_workqueue(pf->wq);
diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h 
b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h
index ced62d112aa2..f44d0a857314 100644
--- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h
@@ -94,6 +94,8 @@ int nfp_nsp_read_sensors(struct nfp_nsp *state, unsigned int 
sensor_mask,
 /* MAC Statistics Accumulator */
 #define NFP_RESOURCE_MAC_STATISTICS"mac.stat"
 
+int nfp_resource_table_init(struct nfp_cpp *cpp);
+
 struct nfp_resource *
 nfp_resource_acquire(struct nfp_cpp *cpp, const char *name);
 
diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cpp.h 
b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cpp.h
index c8f2c064cce3..4e19add1c539 100644
--- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cpp.h
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cpp.h
@@ -295,6 +295,8 @@ void nfp_cpp_mutex_free(struct nfp_cpp_mutex *mutex);
 int nfp_cpp_mutex_lock(struct nfp_cpp_mutex *mutex);
 int nfp_cpp_mutex_unlock(struct nfp_cpp_mutex *mutex);
 int nfp_cpp_mutex_trylock(struct nfp_cpp_mutex *mutex);
+int nfp_cpp_mutex_reclaim(struct nfp_cpp *cpp, int target,
+ unsigned long long address);
 
 /**
  * nfp_cppcore_pcie_unit() - Get PCI Unit of a CPP handle
diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_mutex.c 
b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_mutex.c
index cb28ac03e4ca..c88bf673cb76 100644
--- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_mutex.c
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_mutex.c
@@ -59,6 +59,11 @@ static u32 nfp_mutex_unlocked(u16 interface)
return (u32)interface << 16 | 0x;
 }
 
+static u32 nfp_mutex_owner(u32 val)
+{
+   return val >> 16;
+}
+
 static bool nfp_mutex_is_locked(u32 val)
 {
return (val & 0x) == 0x000f;
@@ -351,3 +356,43 @@ int nfp_cpp_mutex_trylock(struct nfp_cpp_mutex *mutex)
 
return nfp_mutex_is_locked(tmp) ? -EBUSY : -EINVAL;
 }
+
+/**
+ * nfp_cpp_mutex_reclaim() - Unlock mutex if held by local endpoint
+ * @cpp:   NFP CPP handle
+ * @target:NFP CPP target ID (ie NFP_CPP_TARGET_CLS or NFP_CPP_TARGET_MU)
+ * @address:   Offset into the address space of the NFP CPP target ID
+ *
+ * Release lock if held by local system.  Extreme care is advised, call only
+ * when no local lock users can exist.
+ *
+ * Return:  0 if the lock was OK, 1 if locked by us, -errno on invalid 
mutex
+ */
+int nfp_cpp_mutex_reclaim(struct nfp_cpp *cpp, int target,
+ unsigned long long address)
+{
+   const u32 mur = NFP_CPP_ID(target, 3, 0);   /* atomic_read */
+   const u32 muw = NFP_CPP_ID(target, 4, 0);   /* atomic_write */
+   u16 interface = nfp_cpp_interface(cpp);
+   int err;
+   u32 tmp;
+
+   err = nfp_cpp_mutex_validate(interface, &target, address);
+   if (err)
+   return err;
+
+   /* Check lock */
+   err = nfp_cpp_readl(cpp, mur, address, &tmp);
+   if (err < 0)
+   return err;
+
+   if (nfp_mutex_is_unlocked(tmp) || nfp_mutex_owner(tmp) != interface)
+   return 0;
+
+   /* Bust the lock */
+   err = nfp_cpp_writel(cpp, muw, address, nfp_mutex_unlocked(interface));
+   

[PATCH net-next 4/4] nfp: flower: ignore duplicate cb requests for same rule

2018-04-24 Thread Jakub Kicinski
From: John Hurley 

If a flower rule has a repr both as ingress and egress port then 2
callbacks may be generated for the same rule request.

Add an indicator to each flow as to whether or not it was added from an
ingress registered cb. If so then ignore add/del/stat requests to it from
an egress cb.

Signed-off-by: John Hurley 
Reviewed-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/flower/main.h   |  1 +
 .../net/ethernet/netronome/nfp/flower/offload.c| 23 +++---
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.h 
b/drivers/net/ethernet/netronome/nfp/flower/main.h
index 9e6804bc9b40..733ff53cc601 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.h
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.h
@@ -194,6 +194,7 @@ struct nfp_fl_payload {
char *unmasked_data;
char *mask_data;
char *action_data;
+   bool ingress_offload;
 };
 
 struct nfp_fl_stats_frame {
diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c 
b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index bdc82e11a31e..70ec9d821b91 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -345,7 +345,7 @@ nfp_flower_calculate_key_layers(struct nfp_app *app,
 }
 
 static struct nfp_fl_payload *
-nfp_flower_allocate_new(struct nfp_fl_key_ls *key_layer)
+nfp_flower_allocate_new(struct nfp_fl_key_ls *key_layer, bool egress)
 {
struct nfp_fl_payload *flow_pay;
 
@@ -371,6 +371,8 @@ nfp_flower_allocate_new(struct nfp_fl_key_ls *key_layer)
flow_pay->meta.flags = 0;
spin_lock_init(&flow_pay->lock);
 
+   flow_pay->ingress_offload = !egress;
+
return flow_pay;
 
 err_free_mask:
@@ -402,8 +404,20 @@ nfp_flower_add_offload(struct nfp_app *app, struct 
net_device *netdev,
struct nfp_flower_priv *priv = app->priv;
struct nfp_fl_payload *flow_pay;
struct nfp_fl_key_ls *key_layer;
+   struct net_device *ingr_dev;
int err;
 
+   ingr_dev = egress ? NULL : netdev;
+   flow_pay = nfp_flower_search_fl_table(app, flow->cookie, ingr_dev,
+ NFP_FL_STATS_CTX_DONT_CARE);
+   if (flow_pay) {
+   /* Ignore as duplicate if it has been added by different cb. */
+   if (flow_pay->ingress_offload && egress)
+   return 0;
+   else
+   return -EOPNOTSUPP;
+   }
+
key_layer = kmalloc(sizeof(*key_layer), GFP_KERNEL);
if (!key_layer)
return -ENOMEM;
@@ -413,7 +427,7 @@ nfp_flower_add_offload(struct nfp_app *app, struct 
net_device *netdev,
if (err)
goto err_free_key_ls;
 
-   flow_pay = nfp_flower_allocate_new(key_layer);
+   flow_pay = nfp_flower_allocate_new(key_layer, egress);
if (!flow_pay) {
err = -ENOMEM;
goto err_free_key_ls;
@@ -485,7 +499,7 @@ nfp_flower_del_offload(struct nfp_app *app, struct 
net_device *netdev,
nfp_flow = nfp_flower_search_fl_table(app, flow->cookie, ingr_dev,
  NFP_FL_STATS_CTX_DONT_CARE);
if (!nfp_flow)
-   return -ENOENT;
+   return egress ? 0 : -ENOENT;
 
err = nfp_modify_flow_metadata(app, nfp_flow);
if (err)
@@ -534,6 +548,9 @@ nfp_flower_get_stats(struct nfp_app *app, struct net_device 
*netdev,
if (!nfp_flow)
return -EINVAL;
 
+   if (nfp_flow->ingress_offload && egress)
+   return 0;
+
spin_lock_bh(&nfp_flow->lock);
tcf_exts_stats_update(flow->exts, nfp_flow->stats.bytes,
  nfp_flow->stats.pkts, nfp_flow->stats.used);
-- 
2.16.2



[PATCH net-next 3/4] nfp: flower: support offloading multiple rules with same cookie

2018-04-24 Thread Jakub Kicinski
From: John Hurley 

When multiple netdevs are attached to a tc offload block and register for
callbacks, a rule added to the block will be propogated to all netdevs.
Previously these were detected as duplicates (based on cookie) and
rejected. Modify the rule nfp lookup function to optionally include an
ingress netdev and a host context along with the cookie value when
searching for a rule. When a new rule is passed to the driver, the netdev
the rule is to be attached to is considered when searching for dublicates.
When a stats update is received from HW, the host context is used
alongside the cookie to map to the correct host rule.

Signed-off-by: John Hurley 
Reviewed-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/flower/main.h   |  8 +--
 .../net/ethernet/netronome/nfp/flower/metadata.c   | 20 +---
 .../net/ethernet/netronome/nfp/flower/offload.c| 27 --
 3 files changed, 38 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.h 
b/drivers/net/ethernet/netronome/nfp/flower/main.h
index c67e1b54c614..9e6804bc9b40 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.h
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.h
@@ -47,6 +47,7 @@
 struct net_device;
 struct nfp_app;
 
+#define NFP_FL_STATS_CTX_DONT_CARE cpu_to_be32(0x)
 #define NFP_FL_STATS_ENTRY_RS  BIT(20)
 #define NFP_FL_STATS_ELEM_RS   4
 #define NFP_FL_REPEATED_HASH_MAX   BIT(17)
@@ -189,6 +190,7 @@ struct nfp_fl_payload {
spinlock_t lock; /* lock stats */
struct nfp_fl_stats stats;
__be32 nfp_tun_ipv4_addr;
+   struct net_device *ingress_dev;
char *unmasked_data;
char *mask_data;
char *action_data;
@@ -216,12 +218,14 @@ int nfp_flower_compile_action(struct 
tc_cls_flower_offload *flow,
  struct nfp_fl_payload *nfp_flow);
 int nfp_compile_flow_metadata(struct nfp_app *app,
  struct tc_cls_flower_offload *flow,
- struct nfp_fl_payload *nfp_flow);
+ struct nfp_fl_payload *nfp_flow,
+ struct net_device *netdev);
 int nfp_modify_flow_metadata(struct nfp_app *app,
 struct nfp_fl_payload *nfp_flow);
 
 struct nfp_fl_payload *
-nfp_flower_search_fl_table(struct nfp_app *app, unsigned long 
tc_flower_cookie);
+nfp_flower_search_fl_table(struct nfp_app *app, unsigned long tc_flower_cookie,
+  struct net_device *netdev, __be32 host_ctx);
 struct nfp_fl_payload *
 nfp_flower_remove_fl_table(struct nfp_app *app, unsigned long 
tc_flower_cookie);
 
diff --git a/drivers/net/ethernet/netronome/nfp/flower/metadata.c 
b/drivers/net/ethernet/netronome/nfp/flower/metadata.c
index db977cf8e933..21668aa435e8 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/metadata.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/metadata.c
@@ -99,14 +99,18 @@ static int nfp_get_stats_entry(struct nfp_app *app, u32 
*stats_context_id)
 
 /* Must be called with either RTNL or rcu_read_lock */
 struct nfp_fl_payload *
-nfp_flower_search_fl_table(struct nfp_app *app, unsigned long tc_flower_cookie)
+nfp_flower_search_fl_table(struct nfp_app *app, unsigned long tc_flower_cookie,
+  struct net_device *netdev, __be32 host_ctx)
 {
struct nfp_flower_priv *priv = app->priv;
struct nfp_fl_payload *flower_entry;
 
hash_for_each_possible_rcu(priv->flow_table, flower_entry, link,
   tc_flower_cookie)
-   if (flower_entry->tc_flower_cookie == tc_flower_cookie)
+   if (flower_entry->tc_flower_cookie == tc_flower_cookie &&
+   (!netdev || flower_entry->ingress_dev == netdev) &&
+   (host_ctx == NFP_FL_STATS_CTX_DONT_CARE ||
+flower_entry->meta.host_ctx_id == host_ctx))
return flower_entry;
 
return NULL;
@@ -121,13 +125,11 @@ nfp_flower_update_stats(struct nfp_app *app, struct 
nfp_fl_stats_frame *stats)
flower_cookie = be64_to_cpu(stats->stats_cookie);
 
rcu_read_lock();
-   nfp_flow = nfp_flower_search_fl_table(app, flower_cookie);
+   nfp_flow = nfp_flower_search_fl_table(app, flower_cookie, NULL,
+ stats->stats_con_id);
if (!nfp_flow)
goto exit_rcu_unlock;
 
-   if (nfp_flow->meta.host_ctx_id != stats->stats_con_id)
-   goto exit_rcu_unlock;
-
spin_lock(&nfp_flow->lock);
nfp_flow->stats.pkts += be32_to_cpu(stats->pkt_count);
nfp_flow->stats.bytes += be64_to_cpu(stats->byte_count);
@@ -317,7 +319,8 @@ nfp_check_mask_remove(struct nfp_app *app, char *mask_data, 
u32 mask_len,
 
 int nfp_compile_flow_metadata(struct nfp_app *app,
  struct tc_cls_flower_offload *flow,
-  

[PATCH net-next 2/4] nfp: print PCIe link bandwidth on probe

2018-04-24 Thread Jakub Kicinski
To aid debugging of performance issues caused by limited PCIe
bandwidth print the PCIe link information on probe.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Dirk van der Merwe 
---
 drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000_pcie.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000_pcie.c 
b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000_pcie.c
index cd678323bacb..a0e336bd1d85 100644
--- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000_pcie.c
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000_pcie.c
@@ -1330,6 +1330,7 @@ struct nfp_cpp *nfp_cpp_from_nfp6000_pcie(struct pci_dev 
*pdev)
/*  Finished with card initialization. */
dev_info(&pdev->dev,
 "Netronome Flow Processor NFP4000/NFP6000 PCIe Card Probe\n");
+   pcie_print_link_status(pdev);
 
nfp = kzalloc(sizeof(*nfp), GFP_KERNEL);
if (!nfp) {
-- 
2.16.2



[net-next:master 42/70] drivers/net/usb/lan78xx.c:1651:37-38: WARNING: Use ARRAY_SIZE

2018-04-24 Thread kbuild test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
master
head:   c749fa181bd5848be78691d23168ec61ce691b95
commit: 496218656f9857d801512efdec1d609ebbe8a83b [42/70] lan78xx: Add support 
to dump lan78xx registers


coccinelle warnings: (new ones prefixed by >>)

>> drivers/net/usb/lan78xx.c:1651:37-38: WARNING: Use ARRAY_SIZE

vim +1651 drivers/net/usb/lan78xx.c

  1641  
  1642  static void
  1643  lan78xx_get_regs(struct net_device *netdev, struct ethtool_regs *regs,
  1644   void *buf)
  1645  {
  1646  u32 *data = buf;
  1647  int i, j;
  1648  struct lan78xx_net *dev = netdev_priv(netdev);
  1649  
  1650  /* Read Device/MAC registers */
> 1651  for (i = 0; i < (sizeof(lan78xx_regs) / sizeof(u32)); i++)
  1652  lan78xx_read_reg(dev, lan78xx_regs[i], &data[i]);
  1653  
  1654  if (!netdev->phydev)
  1655  return;
  1656  
  1657  /* Read PHY registers */
  1658  for (j = 0; j < 32; i++, j++)
  1659  data[i] = phy_read(netdev->phydev, j);
  1660  }
  1661  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


[PATCH bpf-next 3/4] nfp: bpf: tabularize generations of compare operations

2018-04-24 Thread Jakub Kicinski
There are quite a few compare instructions now, use a table
to translate BPF instruction code to NFP instruction parameters
instead of parameterizing helpers.  This saves LOC and makes
future extensions easier.

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/bpf/jit.c | 168 ++-
 1 file changed, 61 insertions(+), 107 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c 
b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
index a5590988fc69..5b8da7a67df4 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
@@ -1214,45 +1214,79 @@ wrp_test_reg(struct nfp_prog *nfp_prog, struct 
nfp_insn_meta *meta,
return 0;
 }
 
-static int
-wrp_cmp_imm(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta,
-   enum br_mask br_mask, bool swap)
+static const struct jmp_code_map {
+   enum br_mask br_mask;
+   bool swap;
+} jmp_code_map[] = {
+   [BPF_JGT >> 4]  = { BR_BLO, true },
+   [BPF_JGE >> 4]  = { BR_BHS, false },
+   [BPF_JLT >> 4]  = { BR_BLO, false },
+   [BPF_JLE >> 4]  = { BR_BHS, true },
+   [BPF_JSGT >> 4] = { BR_BLT, true },
+   [BPF_JSGE >> 4] = { BR_BGE, false },
+   [BPF_JSLT >> 4] = { BR_BLT, false },
+   [BPF_JSLE >> 4] = { BR_BGE, true },
+};
+
+static const struct jmp_code_map *nfp_jmp_code_get(struct nfp_insn_meta *meta)
+{
+   unsigned int op;
+
+   op = BPF_OP(meta->insn.code) >> 4;
+   /* br_mask of 0 is BR_BEQ which we don't use in jump code table */
+   if (WARN_ONCE(op >= ARRAY_SIZE(jmp_code_map) ||
+ !jmp_code_map[op].br_mask,
+ "no code found for jump instruction"))
+   return NULL;
+
+   return &jmp_code_map[op];
+}
+
+static int cmp_imm(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
 {
const struct bpf_insn *insn = &meta->insn;
u64 imm = insn->imm; /* sign extend */
+   const struct jmp_code_map *code;
u8 reg = insn->dst_reg * 2;
swreg tmp_reg;
 
+   code = nfp_jmp_code_get(meta);
+   if (!code)
+   return -EINVAL;
+
tmp_reg = ur_load_imm_any(nfp_prog, imm & ~0U, imm_b(nfp_prog));
-   if (!swap)
+   if (!code->swap)
emit_alu(nfp_prog, reg_none(), reg_a(reg), ALU_OP_SUB, tmp_reg);
else
emit_alu(nfp_prog, reg_none(), tmp_reg, ALU_OP_SUB, reg_a(reg));
 
tmp_reg = ur_load_imm_any(nfp_prog, imm >> 32, imm_b(nfp_prog));
-   if (!swap)
+   if (!code->swap)
emit_alu(nfp_prog, reg_none(),
 reg_a(reg + 1), ALU_OP_SUB_C, tmp_reg);
else
emit_alu(nfp_prog, reg_none(),
 tmp_reg, ALU_OP_SUB_C, reg_a(reg + 1));
 
-   emit_br(nfp_prog, br_mask, insn->off, 0);
+   emit_br(nfp_prog, code->br_mask, insn->off, 0);
 
return 0;
 }
 
-static int
-wrp_cmp_reg(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta,
-   enum br_mask br_mask, bool swap)
+static int cmp_reg(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
 {
const struct bpf_insn *insn = &meta->insn;
+   const struct jmp_code_map *code;
u8 areg, breg;
 
+   code = nfp_jmp_code_get(meta);
+   if (!code)
+   return -EINVAL;
+
areg = insn->dst_reg * 2;
breg = insn->src_reg * 2;
 
-   if (swap) {
+   if (code->swap) {
areg ^= breg;
breg ^= areg;
areg ^= breg;
@@ -1261,7 +1295,7 @@ wrp_cmp_reg(struct nfp_prog *nfp_prog, struct 
nfp_insn_meta *meta,
emit_alu(nfp_prog, reg_none(), reg_a(areg), ALU_OP_SUB, reg_b(breg));
emit_alu(nfp_prog, reg_none(),
 reg_a(areg + 1), ALU_OP_SUB_C, reg_b(breg + 1));
-   emit_br(nfp_prog, br_mask, insn->off, 0);
+   emit_br(nfp_prog, code->br_mask, insn->off, 0);
 
return 0;
 }
@@ -2283,46 +2317,6 @@ static int jeq_imm(struct nfp_prog *nfp_prog, struct 
nfp_insn_meta *meta)
return 0;
 }
 
-static int jgt_imm(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
-{
-   return wrp_cmp_imm(nfp_prog, meta, BR_BLO, true);
-}
-
-static int jge_imm(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
-{
-   return wrp_cmp_imm(nfp_prog, meta, BR_BHS, false);
-}
-
-static int jlt_imm(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
-{
-   return wrp_cmp_imm(nfp_prog, meta, BR_BLO, false);
-}
-
-static int jle_imm(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
-{
-   return wrp_cmp_imm(nfp_prog, meta, BR_BHS, true);
-}
-
-static int jsgt_imm(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
-{
-   return wrp_cmp_imm(nfp_prog, meta, BR_BLT, true);
-}
-
-static int jsge_imm(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
-{
-   return wrp_cmp_imm(nfp_prog, meta, BR_BGE, false);
-}
-
-static int jslt_imm(struct nfp_prog *nfp_prog, struct

[PATCH bpf-next 0/4] nfp: bpf: optimize negative sums

2018-04-24 Thread Jakub Kicinski
Hi!

This set adds an optimization run to the NFP jit to turn ADD and SUB
instructions with negative immediate into the opposite operation with
a positive immediate.  NFP can fit small immediates into the instructions
but it can't ever fit negative immediates.  Addition of small negative
immediates is quite common in BPF programs for stack address calculations,
therefore this optimization gives us non-negligible savings in instruction
count (up to 4%).


Jakub Kicinski (4):
  nfp: bpf: remove double space
  nfp: bpf: optimize add/sub of a negative constant
  nfp: bpf: tabularize generations of compare operations
  nfp: bpf: optimize comparisons to negative constants

 drivers/net/ethernet/netronome/nfp/bpf/jit.c  | 231 +-
 drivers/net/ethernet/netronome/nfp/bpf/main.h |   6 +-
 2 files changed, 124 insertions(+), 113 deletions(-)

-- 
2.16.2



[PATCH bpf-next 1/4] nfp: bpf: remove double space

2018-04-24 Thread Jakub Kicinski
Whitespace cleanup - remove double space.

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/bpf/jit.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c 
b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
index 29b4e5f8c102..9cc638718272 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
@@ -1400,7 +1400,7 @@ map_call_stack_common(struct nfp_prog *nfp_prog, struct 
nfp_insn_meta *meta)
if (!load_lm_ptr)
return 0;
 
-   emit_csr_wr(nfp_prog, stack_reg(nfp_prog),  NFP_CSR_ACT_LM_ADDR0);
+   emit_csr_wr(nfp_prog, stack_reg(nfp_prog), NFP_CSR_ACT_LM_ADDR0);
wrp_nops(nfp_prog, 3);
 
return 0;
-- 
2.16.2



[PATCH bpf-next 4/4] nfp: bpf: optimize comparisons to negative constants

2018-04-24 Thread Jakub Kicinski
Comparison instruction requires a subtraction.  If the constant
is negative we are more likely to fit it into a NFP instruction
directly if we change the sign and use addition.

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/bpf/jit.c  | 42 +++
 drivers/net/ethernet/netronome/nfp/bpf/main.h |  6 +++-
 2 files changed, 35 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c 
b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
index 5b8da7a67df4..65f0791cae0c 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
@@ -1247,6 +1247,7 @@ static int cmp_imm(struct nfp_prog *nfp_prog, struct 
nfp_insn_meta *meta)
const struct bpf_insn *insn = &meta->insn;
u64 imm = insn->imm; /* sign extend */
const struct jmp_code_map *code;
+   enum alu_op alu_op, carry_op;
u8 reg = insn->dst_reg * 2;
swreg tmp_reg;
 
@@ -1254,19 +1255,22 @@ static int cmp_imm(struct nfp_prog *nfp_prog, struct 
nfp_insn_meta *meta)
if (!code)
return -EINVAL;
 
+   alu_op = meta->jump_neg_op ? ALU_OP_ADD : ALU_OP_SUB;
+   carry_op = meta->jump_neg_op ? ALU_OP_ADD_C : ALU_OP_SUB_C;
+
tmp_reg = ur_load_imm_any(nfp_prog, imm & ~0U, imm_b(nfp_prog));
if (!code->swap)
-   emit_alu(nfp_prog, reg_none(), reg_a(reg), ALU_OP_SUB, tmp_reg);
+   emit_alu(nfp_prog, reg_none(), reg_a(reg), alu_op, tmp_reg);
else
-   emit_alu(nfp_prog, reg_none(), tmp_reg, ALU_OP_SUB, reg_a(reg));
+   emit_alu(nfp_prog, reg_none(), tmp_reg, alu_op, reg_a(reg));
 
tmp_reg = ur_load_imm_any(nfp_prog, imm >> 32, imm_b(nfp_prog));
if (!code->swap)
emit_alu(nfp_prog, reg_none(),
-reg_a(reg + 1), ALU_OP_SUB_C, tmp_reg);
+reg_a(reg + 1), carry_op, tmp_reg);
else
emit_alu(nfp_prog, reg_none(),
-tmp_reg, ALU_OP_SUB_C, reg_a(reg + 1));
+tmp_reg, carry_op, reg_a(reg + 1));
 
emit_br(nfp_prog, code->br_mask, insn->off, 0);
 
@@ -2745,21 +2749,35 @@ static void nfp_bpf_opt_neg_add_sub(struct nfp_prog 
*nfp_prog)
continue;
 
if (BPF_CLASS(insn.code) != BPF_ALU &&
-   BPF_CLASS(insn.code) != BPF_ALU64)
+   BPF_CLASS(insn.code) != BPF_ALU64 &&
+   BPF_CLASS(insn.code) != BPF_JMP)
continue;
if (BPF_SRC(insn.code) != BPF_K)
continue;
if (insn.imm >= 0)
continue;
 
-   if (BPF_OP(insn.code) == BPF_ADD)
-   insn.code = BPF_CLASS(insn.code) | BPF_SUB;
-   else if (BPF_OP(insn.code) == BPF_SUB)
-   insn.code = BPF_CLASS(insn.code) | BPF_ADD;
-   else
-   continue;
+   if (BPF_CLASS(insn.code) == BPF_JMP) {
+   switch (BPF_OP(insn.code)) {
+   case BPF_JGE:
+   case BPF_JSGE:
+   case BPF_JLT:
+   case BPF_JSLT:
+   meta->jump_neg_op = true;
+   break;
+   default:
+   continue;
+   }
+   } else {
+   if (BPF_OP(insn.code) == BPF_ADD)
+   insn.code = BPF_CLASS(insn.code) | BPF_SUB;
+   else if (BPF_OP(insn.code) == BPF_SUB)
+   insn.code = BPF_CLASS(insn.code) | BPF_ADD;
+   else
+   continue;
 
-   meta->insn.code = insn.code | BPF_K;
+   meta->insn.code = insn.code | BPF_K;
+   }
 
meta->insn.imm = -insn.imm;
}
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h 
b/drivers/net/ethernet/netronome/nfp/bpf/main.h
index 4981c8944ca3..68b5d326483d 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.h
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h
@@ -236,6 +236,7 @@ struct nfp_bpf_reg_state {
  * @xadd_over_16bit: 16bit immediate is not guaranteed
  * @xadd_maybe_16bit: 16bit immediate is possible
  * @jmp_dst: destination info for jump instructions
+ * @jump_neg_op: jump instruction has inverted immediate, use ADD instead of 
SUB
  * @func_id: function id for call instructions
  * @arg1: arg1 for call instructions
  * @arg2: arg2 for call instructions
@@ -264,7 +265,10 @@ struct nfp_insn_meta {
bool xadd_maybe_16bit;
};
/* jump */
-   struct nfp_insn_meta *jmp_dst;
+   struct {
+   struct nfp_insn_meta *jmp_dst;
+ 

[PATCH bpf-next 2/4] nfp: bpf: optimize add/sub of a negative constant

2018-04-24 Thread Jakub Kicinski
NFP instruction set can fit small immediates into the instruction.
Negative integers, however, will never fit because they will have
highest bit set.  If we swap the ALU op between ADD and SUB and
negate the constant we have a better chance of fitting small negative
integers into the instruction itself and saving one or two cycles.

immed[gprB_21, 0xfffc]
alu[gprA_4, gprA_4, +, gprB_21], gpr_wrboth
immed[gprB_21, 0x]
alu[gprA_5, gprA_5, +carry, gprB_21], gpr_wrboth

now becomes:

alu[gprA_4, gprA_4, -, 4], gpr_wrboth
alu[gprA_5, gprA_5, -carry, 0], gpr_wrboth

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/bpf/jit.c | 35 
 1 file changed, 35 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c 
b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
index 9cc638718272..a5590988fc69 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
@@ -2777,6 +2777,40 @@ static void nfp_bpf_opt_reg_init(struct nfp_prog 
*nfp_prog)
}
 }
 
+/* abs(insn.imm) will fit better into unrestricted reg immediate -
+ * convert add/sub of a negative number into a sub/add of a positive one.
+ */
+static void nfp_bpf_opt_neg_add_sub(struct nfp_prog *nfp_prog)
+{
+   struct nfp_insn_meta *meta;
+
+   list_for_each_entry(meta, &nfp_prog->insns, l) {
+   struct bpf_insn insn = meta->insn;
+
+   if (meta->skip)
+   continue;
+
+   if (BPF_CLASS(insn.code) != BPF_ALU &&
+   BPF_CLASS(insn.code) != BPF_ALU64)
+   continue;
+   if (BPF_SRC(insn.code) != BPF_K)
+   continue;
+   if (insn.imm >= 0)
+   continue;
+
+   if (BPF_OP(insn.code) == BPF_ADD)
+   insn.code = BPF_CLASS(insn.code) | BPF_SUB;
+   else if (BPF_OP(insn.code) == BPF_SUB)
+   insn.code = BPF_CLASS(insn.code) | BPF_ADD;
+   else
+   continue;
+
+   meta->insn.code = insn.code | BPF_K;
+
+   meta->insn.imm = -insn.imm;
+   }
+}
+
 /* Remove masking after load since our load guarantees this is not needed */
 static void nfp_bpf_opt_ld_mask(struct nfp_prog *nfp_prog)
 {
@@ -3212,6 +3246,7 @@ static int nfp_bpf_optimize(struct nfp_prog *nfp_prog)
 {
nfp_bpf_opt_reg_init(nfp_prog);
 
+   nfp_bpf_opt_neg_add_sub(nfp_prog);
nfp_bpf_opt_ld_mask(nfp_prog);
nfp_bpf_opt_ld_shift(nfp_prog);
nfp_bpf_opt_ldst_gather(nfp_prog);
-- 
2.16.2



[RFC v3 3/5] virtio_ring: add packed ring support

2018-04-24 Thread Tiwei Bie
This commit introduces the basic support (without EVENT_IDX)
for packed ring.

Signed-off-by: Tiwei Bie 
---
 drivers/virtio/virtio_ring.c | 444 ++-
 1 file changed, 434 insertions(+), 10 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index e164822ca66e..0181e93897be 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -58,7 +58,8 @@
 
 struct vring_desc_state {
void *data; /* Data for callback. */
-   struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
+   void *indir_desc;   /* Indirect descriptor, if any. */
+   int num;/* Descriptor list length. */
 };
 
 struct vring_virtqueue {
@@ -142,6 +143,16 @@ struct vring_virtqueue {
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
+static inline bool virtqueue_use_indirect(struct virtqueue *_vq,
+ unsigned int total_sg)
+{
+   struct vring_virtqueue *vq = to_vvq(_vq);
+
+   /* If the host supports indirect descriptor tables, and we have multiple
+* buffers, then go indirect. FIXME: tune this threshold */
+   return (vq->indirect && total_sg > 1 && vq->vq.num_free);
+}
+
 /*
  * Modern virtio devices have feature bits to specify whether they need a
  * quirk and bypass the IOMMU. If not there, just use the DMA API.
@@ -327,9 +338,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
 
head = vq->free_head;
 
-   /* If the host supports indirect descriptor tables, and we have multiple
-* buffers, then go indirect. FIXME: tune this threshold */
-   if (vq->indirect && total_sg > 1 && vq->vq.num_free)
+   if (virtqueue_use_indirect(_vq, total_sg))
desc = alloc_indirect_split(_vq, total_sg, gfp);
else {
desc = NULL;
@@ -741,6 +750,49 @@ static inline unsigned vring_size_packed(unsigned int num, 
unsigned long align)
& ~(align - 1)) + sizeof(struct vring_packed_desc_event) * 2;
 }
 
+static void vring_unmap_one_packed(const struct vring_virtqueue *vq,
+  struct vring_packed_desc *desc)
+{
+   u16 flags;
+
+   if (!vring_use_dma_api(vq->vq.vdev))
+   return;
+
+   flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
+
+   if (flags & VRING_DESC_F_INDIRECT) {
+   dma_unmap_single(vring_dma_dev(vq),
+virtio64_to_cpu(vq->vq.vdev, desc->addr),
+virtio32_to_cpu(vq->vq.vdev, desc->len),
+(flags & VRING_DESC_F_WRITE) ?
+DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   } else {
+   dma_unmap_page(vring_dma_dev(vq),
+  virtio64_to_cpu(vq->vq.vdev, desc->addr),
+  virtio32_to_cpu(vq->vq.vdev, desc->len),
+  (flags & VRING_DESC_F_WRITE) ?
+  DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   }
+}
+
+static struct vring_packed_desc *alloc_indirect_packed(struct virtqueue *_vq,
+  unsigned int total_sg,
+  gfp_t gfp)
+{
+   struct vring_packed_desc *desc;
+
+   /*
+* We require lowmem mappings for the descriptors because
+* otherwise virt_to_phys will give us bogus addresses in the
+* virtqueue.
+*/
+   gfp &= ~__GFP_HIGHMEM;
+
+   desc = kmalloc(total_sg * sizeof(struct vring_packed_desc), gfp);
+
+   return desc;
+}
+
 static inline int virtqueue_add_packed(struct virtqueue *_vq,
   struct scatterlist *sgs[],
   unsigned int total_sg,
@@ -750,47 +802,419 @@ static inline int virtqueue_add_packed(struct virtqueue 
*_vq,
   void *ctx,
   gfp_t gfp)
 {
+   struct vring_virtqueue *vq = to_vvq(_vq);
+   struct vring_packed_desc *desc;
+   struct scatterlist *sg;
+   unsigned int i, n, descs_used, uninitialized_var(prev), err_idx;
+   __virtio16 uninitialized_var(head_flags), flags;
+   int head, wrap_counter;
+   bool indirect;
+
+   START_USE(vq);
+
+   BUG_ON(data == NULL);
+   BUG_ON(ctx && vq->indirect);
+
+   if (unlikely(vq->broken)) {
+   END_USE(vq);
+   return -EIO;
+   }
+
+#ifdef DEBUG
+   {
+   ktime_t now = ktime_get();
+
+   /* No kick or get, with .1 second between?  Warn. */
+   if (vq->last_add_time_valid)
+   WARN_ON(ktime_to_ms(ktime_sub(now, vq->last_add_time))
+   > 100);
+   vq->last_add_time = now;
+   vq

[RFC v3 5/5] virtio_ring: enable packed ring

2018-04-24 Thread Tiwei Bie
Signed-off-by: Tiwei Bie 
---
 drivers/virtio/virtio_ring.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index b1039c2985b9..9a3d13e1e2ba 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -1873,6 +1873,8 @@ void vring_transport_features(struct virtio_device *vdev)
break;
case VIRTIO_F_IOMMU_PLATFORM:
break;
+   case VIRTIO_F_RING_PACKED:
+   break;
default:
/* We don't understand this bit. */
__virtio_clear_bit(vdev, i);
-- 
2.11.0



[RFC v3 2/5] virtio_ring: support creating packed ring

2018-04-24 Thread Tiwei Bie
This commit introduces the support for creating packed ring.
All split ring specific functions are added _split suffix.
Some necessary stubs for packed ring are also added.

Signed-off-by: Tiwei Bie 
---
 drivers/virtio/virtio_ring.c | 764 ---
 include/linux/virtio_ring.h  |   8 +-
 2 files changed, 513 insertions(+), 259 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 71458f493cf8..e164822ca66e 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -64,8 +64,8 @@ struct vring_desc_state {
 struct vring_virtqueue {
struct virtqueue vq;
 
-   /* Actual memory layout for this queue */
-   struct vring vring;
+   /* Is this a packed ring? */
+   bool packed;
 
/* Can we use weak barriers? */
bool weak_barriers;
@@ -79,19 +79,45 @@ struct vring_virtqueue {
/* Host publishes avail event idx */
bool event;
 
-   /* Head of free buffer list. */
-   unsigned int free_head;
/* Number we've added since last sync. */
unsigned int num_added;
 
/* Last used index we've seen. */
u16 last_used_idx;
 
-   /* Last written value to avail->flags */
-   u16 avail_flags_shadow;
+   union {
+   /* Available for split ring */
+   struct {
+   /* Actual memory layout for this queue. */
+   struct vring vring;
 
-   /* Last written value to avail->idx in guest byte order */
-   u16 avail_idx_shadow;
+   /* Head of free buffer list. */
+   unsigned int free_head;
+
+   /* Last written value to avail->flags */
+   u16 avail_flags_shadow;
+
+   /* Last written value to avail->idx in
+* guest byte order. */
+   u16 avail_idx_shadow;
+   };
+
+   /* Available for packed ring */
+   struct {
+   /* Actual memory layout for this queue. */
+   struct vring_packed vring_packed;
+
+   /* Driver ring wrap counter. */
+   u8 wrap_counter;
+
+   /* Index of the next avail descriptor. */
+   unsigned int next_avail_idx;
+
+   /* Last written value to driver->flags in
+* guest byte order. */
+   u16 event_flags_shadow;
+   };
+   };
 
/* How to notify other side. FIXME: commonalize hcalls! */
bool (*notify)(struct virtqueue *vq);
@@ -201,8 +227,17 @@ static dma_addr_t vring_map_single(const struct 
vring_virtqueue *vq,
  cpu_addr, size, direction);
 }
 
-static void vring_unmap_one(const struct vring_virtqueue *vq,
-   struct vring_desc *desc)
+static int vring_mapping_error(const struct vring_virtqueue *vq,
+  dma_addr_t addr)
+{
+   if (!vring_use_dma_api(vq->vq.vdev))
+   return 0;
+
+   return dma_mapping_error(vring_dma_dev(vq), addr);
+}
+
+static void vring_unmap_one_split(const struct vring_virtqueue *vq,
+ struct vring_desc *desc)
 {
u16 flags;
 
@@ -226,17 +261,9 @@ static void vring_unmap_one(const struct vring_virtqueue 
*vq,
}
 }
 
-static int vring_mapping_error(const struct vring_virtqueue *vq,
-  dma_addr_t addr)
-{
-   if (!vring_use_dma_api(vq->vq.vdev))
-   return 0;
-
-   return dma_mapping_error(vring_dma_dev(vq), addr);
-}
-
-static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
-unsigned int total_sg, gfp_t gfp)
+static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
+  unsigned int total_sg,
+  gfp_t gfp)
 {
struct vring_desc *desc;
unsigned int i;
@@ -257,14 +284,14 @@ static struct vring_desc *alloc_indirect(struct virtqueue 
*_vq,
return desc;
 }
 
-static inline int virtqueue_add(struct virtqueue *_vq,
-   struct scatterlist *sgs[],
-   unsigned int total_sg,
-   unsigned int out_sgs,
-   unsigned int in_sgs,
-   void *data,
-   void *ctx,
-   gfp_t gfp)
+static inline int virtqueue_add_split(struct virtqueue *_vq,
+ struct scatterlist *sgs[],
+ unsigned int total_sg,
+ unsigned int out_sgs,
+ unsigned int in_sgs,
+   

[RFC v3 4/5] virtio_ring: add event idx support in packed ring

2018-04-24 Thread Tiwei Bie
This commit introduces the event idx support in packed
ring. This feature is temporarily disabled, because the
implementation in this patch may not work as expected,
and some further discussions on the implementation are
needed, e.g. do we have to check the wrap counter when
checking whether a kick is needed?

Signed-off-by: Tiwei Bie 
---
 drivers/virtio/virtio_ring.c | 53 
 1 file changed, 49 insertions(+), 4 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 0181e93897be..b1039c2985b9 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -986,7 +986,7 @@ static inline int virtqueue_add_packed(struct virtqueue 
*_vq,
 static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
 {
struct vring_virtqueue *vq = to_vvq(_vq);
-   u16 flags;
+   u16 new, old, off_wrap, flags;
bool needs_kick;
u32 snapshot;
 
@@ -995,7 +995,12 @@ static bool virtqueue_kick_prepare_packed(struct virtqueue 
*_vq)
 * suppressions. */
virtio_mb(vq->weak_barriers);
 
+   old = vq->next_avail_idx - vq->num_added;
+   new = vq->next_avail_idx;
+   vq->num_added = 0;
+
snapshot = *(u32 *)vq->vring_packed.device;
+   off_wrap = virtio16_to_cpu(_vq->vdev, snapshot & 0x);
flags = cpu_to_virtio16(_vq->vdev, snapshot >> 16) & 0x3;
 
 #ifdef DEBUG
@@ -1006,7 +1011,10 @@ static bool virtqueue_kick_prepare_packed(struct 
virtqueue *_vq)
vq->last_add_time_valid = false;
 #endif
 
-   needs_kick = (flags != VRING_EVENT_F_DISABLE);
+   if (flags == VRING_EVENT_F_DESC)
+   needs_kick = vring_need_event(off_wrap & ~(1<<15), new, old);
+   else
+   needs_kick = (flags != VRING_EVENT_F_DISABLE);
END_USE(vq);
return needs_kick;
 }
@@ -1116,6 +1124,15 @@ static void *virtqueue_get_buf_ctx_packed(struct 
virtqueue *_vq,
if (vq->last_used_idx >= vq->vring_packed.num)
vq->last_used_idx -= vq->vring_packed.num;
 
+   /* If we expect an interrupt for the next entry, tell host
+* by writing event index and flush out the write before
+* the read in the next get_buf call. */
+   if (vq->event_flags_shadow == VRING_EVENT_F_DESC)
+   virtio_store_mb(vq->weak_barriers,
+   &vq->vring_packed.driver->off_wrap,
+   cpu_to_virtio16(_vq->vdev, vq->last_used_idx |
+   (vq->wrap_counter << 15)));
+
 #ifdef DEBUG
vq->last_add_time_valid = false;
 #endif
@@ -1143,10 +1160,17 @@ static unsigned 
virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
 
/* We optimistically turn back on interrupts, then check if there was
 * more to do. */
+   /* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
+* either clear the flags bit or point the event index at the next
+* entry. Always update the event index to keep code simple. */
+
+   vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
+   vq->last_used_idx | (vq->wrap_counter << 15));
 
if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
virtio_wmb(vq->weak_barriers);
-   vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
+   vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
+VRING_EVENT_F_ENABLE;
vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
vq->event_flags_shadow);
}
@@ -1172,15 +1196,34 @@ static bool virtqueue_poll_packed(struct virtqueue 
*_vq, unsigned last_used_idx)
 static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
 {
struct vring_virtqueue *vq = to_vvq(_vq);
+   u16 bufs, used_idx, wrap_counter;
 
START_USE(vq);
 
/* We optimistically turn back on interrupts, then check if there was
 * more to do. */
+   /* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
+* either clear the flags bit or point the event index at the next
+* entry. Always update the event index to keep code simple. */
+
+   /* TODO: tune this threshold */
+   bufs = (u16)(vq->next_avail_idx - vq->last_used_idx) * 3 / 4;
+
+   used_idx = vq->last_used_idx + bufs;
+   wrap_counter = vq->wrap_counter;
+
+   if (used_idx >= vq->vring_packed.num) {
+   used_idx -= vq->vring_packed.num;
+   wrap_counter ^= 1;
+   }
+
+   vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
+   used_idx | (wrap_counter << 15));
 
if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
virtio_wmb(vq->weak_barriers);
-   vq->event_flags_shadow = VRING_EVENT_F_ENABLE;

[RFC v3 0/5] virtio: support packed ring

2018-04-24 Thread Tiwei Bie
Hello everyone,

This RFC implements packed ring support in virtio driver.

Some simple functional tests have been done with Jason's
packed ring implementation in vhost:

https://lkml.org/lkml/2018/4/23/12

Both of ping and netperf worked as expected (with EVENT_IDX
disabled). But there are below known issues:

1. Reloading the guest driver will break the Tx/Rx;
2. Zeroing the flags when detaching a used desc will
   break the guest -> host path.

Some simple functional tests have also been done with
Wei's packed ring implementation in QEMU:

http://lists.nongnu.org/archive/html/qemu-devel/2018-04/msg00342.html

Both of ping and netperf worked as expected (with EVENT_IDX
disabled). Reloading the guest driver also worked as expected.

TODO:
- Refinements (for code and commit log) and bug fixes;
- Discuss/fix/test EVENT_IDX support;
- Test devices other than net;

RFC v2 -> RFC v3:
- Split into small patches (Jason);
- Add helper virtqueue_use_indirect() (Jason);
- Just set id for the last descriptor of a list (Jason);
- Calculate the prev in virtqueue_add_packed() (Jason);
- Fix/improve desc suppression code (Jason/MST);
- Refine the code layout for XXX_split/packed and wrappers (MST);
- Fix the comments and API in uapi (MST);
- Remove the BUG_ON() for indirect (Jason);
- Some other refinements and bug fixes;

RFC v1 -> RFC v2:
- Add indirect descriptor support - compile test only;
- Add event suppression supprt - compile test only;
- Move vring_packed_init() out of uapi (Jason, MST);
- Merge two loops into one in virtqueue_add_packed() (Jason);
- Split vring_unmap_one() for packed ring and split ring (Jason);
- Avoid using '%' operator (Jason);
- Rename free_head -> next_avail_idx (Jason);
- Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
- Some other refinements and bug fixes;

Thanks!

Tiwei Bie (5):
  virtio: add packed ring definitions
  virtio_ring: support creating packed ring
  virtio_ring: add packed ring support
  virtio_ring: add event idx support in packed ring
  virtio_ring: enable packed ring

 drivers/virtio/virtio_ring.c   | 1271 
 include/linux/virtio_ring.h|8 +-
 include/uapi/linux/virtio_config.h |   12 +-
 include/uapi/linux/virtio_ring.h   |   36 +
 4 files changed, 1049 insertions(+), 278 deletions(-)

-- 
2.11.0



[RFC v3 1/5] virtio: add packed ring definitions

2018-04-24 Thread Tiwei Bie
Signed-off-by: Tiwei Bie 
---
 include/uapi/linux/virtio_config.h | 12 +++-
 include/uapi/linux/virtio_ring.h   | 36 
 2 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/virtio_config.h 
b/include/uapi/linux/virtio_config.h
index 308e2096291f..a6e392325e3a 100644
--- a/include/uapi/linux/virtio_config.h
+++ b/include/uapi/linux/virtio_config.h
@@ -49,7 +49,7 @@
  * transport being used (eg. virtio_ring), the rest are per-device feature
  * bits. */
 #define VIRTIO_TRANSPORT_F_START   28
-#define VIRTIO_TRANSPORT_F_END 34
+#define VIRTIO_TRANSPORT_F_END 36
 
 #ifndef VIRTIO_CONFIG_NO_LEGACY
 /* Do we get callbacks when the ring is completely used, even if we've
@@ -71,4 +71,14 @@
  * this is for compatibility with legacy systems.
  */
 #define VIRTIO_F_IOMMU_PLATFORM33
+
+/* This feature indicates support for the packed virtqueue layout. */
+#define VIRTIO_F_RING_PACKED   34
+
+/*
+ * This feature indicates that all buffers are used by the device
+ * in the same order in which they have been made available.
+ */
+#define VIRTIO_F_IN_ORDER  35
+
 #endif /* _UAPI_LINUX_VIRTIO_CONFIG_H */
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index 6d5d5faa989b..3932cb80c347 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -44,6 +44,9 @@
 /* This means the buffer contains a list of buffer descriptors. */
 #define VRING_DESC_F_INDIRECT  4
 
+#define VRING_DESC_F_AVAIL(b)  ((b) << 7)
+#define VRING_DESC_F_USED(b)   ((b) << 15)
+
 /* The Host uses this in used->flags to advise the Guest: don't kick me when
  * you add a buffer.  It's unreliable, so it's simply an optimization.  Guest
  * will still kick if it's out of buffers. */
@@ -53,6 +56,10 @@
  * optimization.  */
 #define VRING_AVAIL_F_NO_INTERRUPT 1
 
+#define VRING_EVENT_F_ENABLE   0x0
+#define VRING_EVENT_F_DISABLE  0x1
+#define VRING_EVENT_F_DESC 0x2
+
 /* We support indirect buffer descriptors */
 #define VIRTIO_RING_F_INDIRECT_DESC28
 
@@ -171,4 +178,33 @@ static inline int vring_need_event(__u16 event_idx, __u16 
new_idx, __u16 old)
return (__u16)(new_idx - event_idx - 1) < (__u16)(new_idx - old);
 }
 
+struct vring_packed_desc_event {
+   /* __virtio16 off  : 15; // Descriptor Event Offset
+* __virtio16 wrap : 1;  // Descriptor Event Wrap Counter */
+   __virtio16 off_wrap;
+   /* __virtio16 flags : 2; // Descriptor Event Flags */
+   __virtio16 flags;
+};
+
+struct vring_packed_desc {
+   /* Buffer Address. */
+   __virtio64 addr;
+   /* Buffer Length. */
+   __virtio32 len;
+   /* Buffer ID. */
+   __virtio16 id;
+   /* The flags depending on descriptor type. */
+   __virtio16 flags;
+};
+
+struct vring_packed {
+   unsigned int num;
+
+   struct vring_packed_desc *desc;
+
+   struct vring_packed_desc_event *driver;
+
+   struct vring_packed_desc_event *device;
+};
+
 #endif /* _UAPI_LINUX_VIRTIO_RING_H */
-- 
2.11.0



KASAN: stack-out-of-bounds Write in compat_copy_entries

2018-04-24 Thread syzbot

Hello,

syzbot hit the following crash on upstream commit
24cac7009cb1b211f1c793ecb6a462c03dc35818 (Tue Apr 24 21:16:40 2018 +)
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=4e42a04e0bc33cb6c087


So far this crash happened 3 times on upstream.
syzkaller reproducer:  
https://syzkaller.appspot.com/x/repro.syz?id=4827027970457600
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=6212733133389824
Kernel config:  
https://syzkaller.appspot.com/x/.config?id=7043958930931867332

compiler: gcc (GCC) 8.0.1 20180413 (experimental)
user-space arch: i386

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+4e42a04e0bc33cb6c...@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for  
details.

If you forward the report, please keep this part and the footer.

random: sshd: uninitialized urandom read (32 bytes read)
random: sshd: uninitialized urandom read (32 bytes read)
random: sshd: uninitialized urandom read (32 bytes read)
IPVS: ftp: loaded support on port[0] = 21
==
BUG: KASAN: stack-out-of-bounds in strlcpy include/linux/string.h:300  
[inline]
BUG: KASAN: stack-out-of-bounds in compat_mtw_from_user  
net/bridge/netfilter/ebtables.c:1957 [inline]
BUG: KASAN: stack-out-of-bounds in ebt_size_mwt  
net/bridge/netfilter/ebtables.c:2059 [inline]
BUG: KASAN: stack-out-of-bounds in size_entry_mwt  
net/bridge/netfilter/ebtables.c:2155 [inline]
BUG: KASAN: stack-out-of-bounds in compat_copy_entries+0x96c/0x14a0  
net/bridge/netfilter/ebtables.c:2194

Write of size 33 at addr 8801b0abf888 by task syz-executor0/4504

CPU: 0 PID: 4504 Comm: syz-executor0 Not tainted 4.17.0-rc2+ #40
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011

Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1b9/0x294 lib/dump_stack.c:113
 print_address_description+0x6c/0x20b mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
 check_memory_region_inline mm/kasan/kasan.c:260 [inline]
 check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
 memcpy+0x37/0x50 mm/kasan/kasan.c:303
 strlcpy include/linux/string.h:300 [inline]
 compat_mtw_from_user net/bridge/netfilter/ebtables.c:1957 [inline]
 ebt_size_mwt net/bridge/netfilter/ebtables.c:2059 [inline]
 size_entry_mwt net/bridge/netfilter/ebtables.c:2155 [inline]
 compat_copy_entries+0x96c/0x14a0 net/bridge/netfilter/ebtables.c:2194
 compat_do_replace+0x483/0x900 net/bridge/netfilter/ebtables.c:2285
 compat_do_ebt_set_ctl+0x2ac/0x324 net/bridge/netfilter/ebtables.c:2367
 compat_nf_sockopt net/netfilter/nf_sockopt.c:144 [inline]
 compat_nf_setsockopt+0x9b/0x140 net/netfilter/nf_sockopt.c:156
 compat_ip_setsockopt+0xff/0x140 net/ipv4/ip_sockglue.c:1279
 inet_csk_compat_setsockopt+0x97/0x120 net/ipv4/inet_connection_sock.c:1041
 compat_tcp_setsockopt+0x49/0x80 net/ipv4/tcp.c:2901
 compat_sock_common_setsockopt+0xb4/0x150 net/core/sock.c:3050
 __compat_sys_setsockopt+0x1ab/0x7c0 net/compat.c:403
 __do_compat_sys_setsockopt net/compat.c:416 [inline]
 __se_compat_sys_setsockopt net/compat.c:413 [inline]
 __ia32_compat_sys_setsockopt+0xbd/0x150 net/compat.c:413
 do_syscall_32_irqs_on arch/x86/entry/common.c:323 [inline]
 do_fast_syscall_32+0x345/0xf9b arch/x86/entry/common.c:394
 entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
RIP: 0023:0xf7fb3cb9
RSP: 002b:fff0c26c EFLAGS: 0282 ORIG_RAX: 016e
RAX: ffda RBX: 0003 RCX: 
RDX: 0080 RSI: 2300 RDI: 05f4
RBP:  R08:  R09: 
R10:  R11:  R12: 
R13:  R14:  R15: 

The buggy address belongs to the page:
page:ea0006c2afc0 count:0 mapcount:0 mapping: index:0x0
flags: 0x2fffc00()
raw: 02fffc00   
raw:  ea0006c20101  
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 8801b0abf780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 8801b0abf800: 00 00 00 00 00 f1 f1 f1 f1 00 00 f2 f2 f2 f2 f2

8801b0abf880: f2 00 00 00 07 f3 f3 f3 f3 00 00 00 00 00 00 00

   ^
 8801b0abf900: 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 00
 8801b0abf980: 00 f2 f2 f2 f2 f2 00 00 00 00 00 00 00 00 00 00
==


---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to syzkal...@googlegroups.com.

syzbot will keep tra

RE: [PATCH net-next] lan78xx: Lan7801 Support for Fixed PHY

2018-04-24 Thread RaghuramChary.Jallipalli
Hi Andrew,

> > +#define DRIVER_VERSION "1.0.7"
> 
> Hi Raghuram
> 
> Driver version strings a pretty pointless. You might want to remove it.
>
OK, will remove it.
 
> > +   netdev_info(dev->net, "Registered FIXED PHY\n");
> 
> There are too many detdev_info() messages here. Maybe make them both
> netdev_dbg().
>

OK. Will modify to netdev_dbg()
 
> > +   dev->interface = PHY_INTERFACE_MODE_RGMII;
> > +   dev->fixedphy = phydev;
> 
> You can use
> 
> if (!phy_is_pseudo_fixed_link(phydev))
> 
> to determine is a PHY is a fixed phy. I think you can then do without
> dev->fixedphy.
> 
dev->fixedphy stores the fixed phydev, which will be passed to the 
fixed_phy_unregister routine , so I think phy_is_pseudo_fixed_link check is not 
necessary.

> > +   ret = lan78xx_write_reg(dev, HW_CFG, buf);
> > +   goto phyinit;
> 
> Please don't use a goto like this. Maybe turn this into a switch statement?
>
OK. Will modify.
 
> > static int lan78xx_phy_init(struct lan78xx_net *dev)
> > goto error;
> 
> Please take a look at what happens at error: It does not look correct.
> Probably now is a good time to refactor the whole of lan78xx_phy_init()
> 

OK. Will take care.

Thanks,
-Raghu


[PATCH net-next 2/2] selftests: net: tcp_mmap must use TCP_ZEROCOPY_RECEIVE

2018-04-24 Thread Eric Dumazet
After prior kernel change, mmap() on TCP socket only reserves VMA.

We have to use setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...)
to perform the transfert of pages from skbs in TCP receive queue into such VMA.

struct tcp_zerocopy_receive {
__u64 address;  /* in: address of mapping */
__u32 length;   /* in/out: number of bytes to map/mapped */
__u32 recv_skip_hint;   /* out: amount of bytes to skip */
};

After a successful setsockopt(...TCP_ZEROCOPY_RECEIVE...), @length contains
number of bytes that were mapped, and @recv_skip_hint contains number of bytes
that should be read using conventional read()/recv()/recvmsg() system calls,
to skip a sequence of bytes that can not be mapped, because not properly page
aligned.

Signed-off-by: Eric Dumazet 
Cc: Andy Lutomirski 
Cc: Soheil Hassas Yeganeh 
---
 tools/testing/selftests/net/tcp_mmap.c | 63 +++---
 1 file changed, 36 insertions(+), 27 deletions(-)

diff --git a/tools/testing/selftests/net/tcp_mmap.c 
b/tools/testing/selftests/net/tcp_mmap.c
index 
dea342fe6f4e88b5709d2ac37b2fc9a2a320bf44..5b381cdbdd6319556ba4e3dad530fae8f13f5a9b
 100644
--- a/tools/testing/selftests/net/tcp_mmap.c
+++ b/tools/testing/selftests/net/tcp_mmap.c
@@ -76,9 +76,10 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
+#include 
+#include 
 
 #ifndef MSG_ZEROCOPY
 #define MSG_ZEROCOPY0x400
@@ -134,11 +135,12 @@ void hash_zone(void *zone, unsigned int length)
 void *child_thread(void *arg)
 {
unsigned long total_mmap = 0, total = 0;
+   struct tcp_zerocopy_receive zc;
unsigned long delta_usec;
int flags = MAP_SHARED;
struct timeval t0, t1;
char *buffer = NULL;
-   void *oaddr = NULL;
+   void *addr = NULL;
double throughput;
struct rusage ru;
int lu, fd;
@@ -153,41 +155,45 @@ void *child_thread(void *arg)
perror("malloc");
goto error;
}
+   if (zflg) {
+   addr = mmap(NULL, chunk_size, PROT_READ, flags, fd, 0);
+   if (addr == (void *)-1)
+   zflg = 0;
+   }
while (1) {
struct pollfd pfd = { .fd = fd, .events = POLLIN, };
int sub;
 
poll(&pfd, 1, 1);
if (zflg) {
-   void *naddr;
+   int res;
 
-   naddr = mmap(oaddr, chunk_size, PROT_READ, flags, fd, 
0);
-   if (naddr == (void *)-1) {
-   if (errno == EAGAIN) {
-   /* That is if SO_RCVLOWAT is buggy */
-   usleep(1000);
-   continue;
-   }
-   if (errno == EINVAL) {
-   flags = MAP_SHARED;
-   oaddr = NULL;
-   goto fallback;
-   }
-   if (errno != EIO)
-   perror("mmap()");
+   zc.address = (__u64)addr;
+   zc.length = chunk_size;
+   zc.recv_skip_hint = 0;
+   res = setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE,
+&zc, sizeof(zc));
+   if (res == -1)
break;
+
+   if (zc.length) {
+   assert(zc.length <= chunk_size);
+   total_mmap += zc.length;
+   if (xflg)
+   hash_zone(addr, zc.length);
+   total += zc.length;
}
-   total_mmap += chunk_size;
-   if (xflg)
-   hash_zone(naddr, chunk_size);
-   total += chunk_size;
-   if (!keepflag) {
-   flags |= MAP_FIXED;
-   oaddr = naddr;
+   if (zc.recv_skip_hint) {
+   assert(zc.recv_skip_hint <= chunk_size);
+   lu = read(fd, buffer, zc.recv_skip_hint);
+   if (lu > 0) {
+   if (xflg)
+   hash_zone(buffer, lu);
+   total += lu;
+   }
}
continue;
}
-fallback:
sub = 0;
while (sub < chunk_size) {
lu = read(fd, buffer + sub, chunk_size - sub);
@@ -228,6 +234,8 @@ void *child_thread(void *arg)
 error:
free(bu

[PATCH net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive

2018-04-24 Thread Eric Dumazet
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.

Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.

1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
  This operation does not involve any TCP locking.

2) setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
 the transfert of pages from skbs to one VMA.
  This operation only uses down_read(¤t->mm->mmap_sem) after
  holding TCP lock, thus solving the lockdep issue.

This new implementation was suggested by Andy Lutomirski with great details.

Benefits are :

- Better scalability, in case multiple threads reuse VMAS
   (without mmap()/munmap() calls) since mmap_sem wont be write locked.

- Better error recovery.
   The previous mmap() model had to provide the expected size of the
   mapping. If for some reason one part could not be mapped (partial MSS),
   the whole operation had to be aborted.
   With the tcp_zerocopy_receive struct, kernel can report how
   many bytes were successfuly mapped, and how many bytes should
   be read to skip the problematic sequence.

- No more memory allocation to hold an array of page pointers.
  16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/

- skbs are freed while mmap_sem has been released

Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)

Note that memcg might require additional changes.

Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
Suggested-by: Andy Lutomirski 
Cc: linux...@kvack.org
Cc: Soheil Hassas Yeganeh 
---
 include/uapi/linux/tcp.h |   8 ++
 net/ipv4/tcp.c   | 186 ---
 2 files changed, 103 insertions(+), 91 deletions(-)

diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 
379b08700a542d49bbce9b4b49b17879d00b69bb..e9e8373b34b9ddc735329341b91f455bf5c0b17c
 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -122,6 +122,7 @@ enum {
 #define TCP_MD5SIG_EXT 32  /* TCP MD5 Signature with extensions */
 #define TCP_FASTOPEN_KEY   33  /* Set the key for Fast Open (cookie) */
 #define TCP_FASTOPEN_NO_COOKIE 34  /* Enable TFO without a TFO cookie */
+#define TCP_ZEROCOPY_RECEIVE   35
 
 struct tcp_repair_opt {
__u32   opt_code;
@@ -276,4 +277,11 @@ struct tcp_diag_md5sig {
__u8tcpm_key[TCP_MD5SIG_MAXKEYLEN];
 };
 
+/* setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) */
+
+struct tcp_zerocopy_receive {
+   __u64 address;  /* in: address of mapping */
+   __u32 length;   /* in/out: number of bytes to map/mapped */
+   __u32 recv_skip_hint;   /* out: amount of bytes to skip */
+};
 #endif /* _UAPI_LINUX_TCP_H */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 
dfd090ea54ad47112fc23c61180b5bf8edd2c736..a28eca97df9465a0aa522a1833a171b53c237b1f
 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1726,118 +1726,108 @@ int tcp_set_rcvlowat(struct sock *sk, int val)
 }
 EXPORT_SYMBOL(tcp_set_rcvlowat);
 
-/* When user wants to mmap X pages, we first need to perform the mapping
- * before freeing any skbs in receive queue, otherwise user would be unable
- * to fallback to standard recvmsg(). This happens if some data in the
- * requested block is not exactly fitting in a page.
- *
- * We only support order-0 pages for the moment.
- * mmap() on TCP is very strict, there is no point
- * trying to accommodate with pathological layouts.
- */
+static const struct vm_operations_struct tcp_vm_ops = {
+};
+
 int tcp_mmap(struct file *file, struct socket *sock,
 struct vm_area_struct *vma)
 {
-   unsigned long size = vma->vm_end - vma->vm_start;
-   unsigned int nr_pages = size >> PAGE_SHIFT;
-   struct page **pages_array = NULL;
-   u32 seq, len, offset, nr = 0;
-   struct sock *sk = sock->sk;
-   const skb_frag_t *frags;
+   if (vma->vm_flags & (VM_WRITE | VM_EXEC))
+   return -EPERM;
+   vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
+
+   /* Instruct vm_insert_page() to not down_read(mmap_sem) */
+   vma->vm_flags |= VM_MIXEDMAP;
+
+   vma->vm_ops = &tcp_vm_ops;
+   return 0;
+}
+EXPORT_SYMBOL(tcp_mmap);
+
+static int tcp_zerocopy_receive(struct sock *sk,
+   struct tcp_zerocopy_receive *zc)
+{
+   unsigned long address = (unsigned long)zc->address;
+   const skb_frag_t *frags = NULL;
+   u32 length = 0, seq, offset;
+   struct vm_area_struct *vma;
+   struct sk_buff *skb = NULL;
struct tcp_sock *tp;
-   struct sk_buff *skb;
int ret;
 
-   if (vma->vm_pgoff || !nr_pages)
+   if (address & (PAGE_SIZE - 1) || address != zc->address)
return -EINVAL;
 
-

[PATCH net-next 0/2] tcp: mmap: rework zerocopy receive

2018-04-24 Thread Eric Dumazet
syzbot reported a lockdep issue caused by tcp mmap() support.

I implemented Andy Lutomirski nice suggestions to resolve the
issue and increase scalability as well.

First patch is adding a new setsockopt() operation and changes mmap()
behavior.

Second patch changes tcp_mmap reference program.

Eric Dumazet (2):
  tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
  selftests: net: tcp_mmap must use TCP_ZEROCOPY_RECEIVE

 include/uapi/linux/tcp.h   |   8 ++
 net/ipv4/tcp.c | 186 +
 tools/testing/selftests/net/tcp_mmap.c |  63 +
 3 files changed, 139 insertions(+), 118 deletions(-)

-- 
2.17.0.484.g0c8726318c-goog



[RFC bpf] bpf, x64: fix JIT emission for dead code

2018-04-24 Thread Gianluca Borello
Commit 2a5418a13fcf ("bpf: improve dead code sanitizing") replaced dead
code with a series of ja-1 instructions, for safety. That made JIT
compilation much more complex for some BPF programs. One instance of such
programs is, for example:

bool flag = false
...
/* A bunch of other code */
...
if (flag)
do_something()

In some cases llvm is not able to remove at compile time the code for
do_something(), so the generated BPF program ends up with a large amount
of dead instructions. In one specific real life example, there are two
series of ~500 and ~1000 dead instructions in the program. When the
verifier replaces them with a series of ja-1 instructions, it causes an
interesting behavior at JIT time.

During the first pass, since all the instructions are estimated at 64
bytes, the ja-1 instructions end up being translated as 5 bytes JMP
instructions (0xE9), since the jump offsets become increasingly large (>
127) as each instruction gets discovered to be 5 bytes instead of the
estimated 64.

Starting from the second pass, the first N instructions of the ja-1
sequence get translated into 2 bytes JMPs (0xEB) because the jump offsets
become <= 127 this time. In particular, N is defined as roughly 127 / (5
- 2) ~= 42. So, each further pass will make the subsequent N JMP
instructions shrink from 5 to 2 bytes, making the image shrink every time.
This means that in order to have the entire program converge, there need
to be, in the real example above, at least ~1000 / 42 ~= 24 passes just
for translating the dead code. If we add this number to the passes needed
to translate the other non dead code, it brings such program to 40+
passes, and JIT doesn't complete. Ultimately the userspace loader fails
because such BPF program was supposed to be part of a prog array owner
being JITed.

While it is certainly possible to try to refactor such programs to help
the compiler remove dead code, the behavior is not really intuitive and it
puts further burden on the BPF developer who is not expecting such
behavior. To make things worse, such programs are working just fine in all
the kernel releases prior to the ja-1 fix.

A possible approach to mitigate this behavior consists into noticing that
for ja-1 instructions we don't really need to rely on the estimated size
of the previous and current instructions, we know that a -1 BPF jump
offset can be safely translated into a 0xEB instruction with a jump offset
of -2.

Such fix brings the BPF program in the previous example to complete again
in ~9 passes.

Fixes: 2a5418a13fcf ("bpf: improve dead code sanitizing")
Signed-off-by: Gianluca Borello 
---
Hi

Posting this as RFC since I just wanted to report this potential bug and
propose a possible fix, although I'm not sure if:

1) There might be a better fix on the JIT side
2) We might want to replace the ja-1 instructions with something else
3) We might want to ignore this issue

If we choose option 3, I'd just like to point out that this caused a real
regression on a couple BPF programs that are part of a larger collection
of programs that used to work fine on 4.15 and I recently found broken 
in 4.16, so there would be some value in somehow addressing this.

Thanks

 arch/x86/net/bpf_jit_comp.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index b725154182cc..abce27ceb411 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1027,7 +1027,17 @@ xadd:if (is_imm8(insn->off))
break;
 
case BPF_JMP | BPF_JA:
-   jmp_offset = addrs[i + insn->off] - addrs[i];
+   if (insn->off == -1)
+   /* -1 jmp instructions will always jump
+* backwards two bytes. Explicitly handling
+* this case avoids wasting too many passes
+* when there are long sequences of replaced
+* dead code.
+*/
+   jmp_offset = -2;
+   else
+   jmp_offset = addrs[i + insn->off] - addrs[i];
+
if (!jmp_offset)
/* optimize out nop jumps */
break;
-- 
2.17.0



Re: [PATCH net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive

2018-04-24 Thread Christoph Hellwig
On Tue, Apr 24, 2018 at 10:27:21PM -0700, Eric Dumazet wrote:
> When adding tcp mmap() implementation, I forgot that socket lock
> had to be taken before current->mm->mmap_sem. syzbot eventually caught
> the bug.
> 
> Since we can not lock the socket in tcp mmap() handler we have to
> split the operation in two phases.
> 
> 1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
>   This operation does not involve any TCP locking.
> 
> 2) setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
>  the transfert of pages from skbs to one VMA.
>   This operation only uses down_read(¤t->mm->mmap_sem) after
>   holding TCP lock, thus solving the lockdep issue.
> 
> This new implementation was suggested by Andy Lutomirski with great details.

Thanks, this looks much more sensible to me.


Re: [PATCH net-next 3/4] nfp: flower: support offloading multiple rules with same cookie

2018-04-24 Thread Or Gerlitz
On Wed, Apr 25, 2018 at 7:17 AM, Jakub Kicinski
 wrote:
> From: John Hurley 
>
> When multiple netdevs are attached to a tc offload block and register for
> callbacks, a rule added to the block will be propogated to all netdevs.
> Previously these were detected as duplicates (based on cookie) and
> rejected. Modify the rule nfp lookup function to optionally include an
> ingress netdev and a host context along with the cookie value when
> searching for a rule. When a new rule is passed to the driver, the netdev
> the rule is to be attached to is considered when searching for dublicates.

so if the same rule (cookie) is provided to the driver through multiple ingress
devices you will not reject it -- what is the use case for that, is it
block sharing?


[PATCH v1 net-next] lan78xx: Lan7801 Support for Fixed PHY

2018-04-24 Thread Raghuram Chary J
Adding Fixed PHY support to the lan78xx driver.

Signed-off-by: Raghuram Chary J 
---
v0->v1:
   * Remove driver version #define
   * Modify netdev_info to netdev_dbg
   * Move lan7801 specific to new routine and add switch case
   * Minor cleanup
---
 drivers/net/usb/Kconfig   |   1 +
 drivers/net/usb/lan78xx.c | 113 --
 2 files changed, 79 insertions(+), 35 deletions(-)

diff --git a/drivers/net/usb/Kconfig b/drivers/net/usb/Kconfig
index f28bd74ac275..418b0904cecb 100644
--- a/drivers/net/usb/Kconfig
+++ b/drivers/net/usb/Kconfig
@@ -111,6 +111,7 @@ config USB_LAN78XX
select MII
select PHYLIB
select MICROCHIP_PHY
+   select FIXED_PHY
help
  This option adds support for Microchip LAN78XX based USB 2
  & USB 3 10/100/1000 Ethernet adapters.
diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index 0867f7275852..47fa34a2d09f 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -36,13 +36,12 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include "lan78xx.h"
 
 #define DRIVER_AUTHOR  "WOOJUNG HUH "
 #define DRIVER_DESC"LAN78XX USB 3.0 Gigabit Ethernet Devices"
 #define DRIVER_NAME"lan78xx"
-#define DRIVER_VERSION "1.0.6"
 
 #define TX_TIMEOUT_JIFFIES (5 * HZ)
 #define THROTTLE_JIFFIES   (HZ / 8)
@@ -402,6 +401,7 @@ struct lan78xx_net {
struct statstagestats;
 
struct irq_domain_data  domain_data;
+   struct phy_device   *fixedphy;
 };
 
 /* define external phy id */
@@ -1477,7 +1477,6 @@ static void lan78xx_get_drvinfo(struct net_device *net,
struct lan78xx_net *dev = netdev_priv(net);
 
strncpy(info->driver, DRIVER_NAME, sizeof(info->driver));
-   strncpy(info->version, DRIVER_VERSION, sizeof(info->version));
usb_make_path(dev->udev, info->bus_info, sizeof(info->bus_info));
 }
 
@@ -2003,52 +2002,91 @@ static int ksz9031rnx_fixup(struct phy_device *phydev)
return 1;
 }
 
-static int lan78xx_phy_init(struct lan78xx_net *dev)
+static int lan7801_phy_init(struct phy_device **phydev,
+   struct lan78xx_net *dev)
 {
+   u32 buf;
int ret;
-   u32 mii_adv;
-   struct phy_device *phydev;
-
-   phydev = phy_find_first(dev->mdiobus);
-   if (!phydev) {
-   netdev_err(dev->net, "no PHY found\n");
-   return -EIO;
-   }
-
-   if ((dev->chipid == ID_REV_CHIP_ID_7800_) ||
-   (dev->chipid == ID_REV_CHIP_ID_7850_)) {
-   phydev->is_internal = true;
-   dev->interface = PHY_INTERFACE_MODE_GMII;
-
-   } else if (dev->chipid == ID_REV_CHIP_ID_7801_) {
-   if (!phydev->drv) {
+   struct fixed_phy_status fphy_status = {
+   .link = 1,
+   .speed = SPEED_1000,
+   .duplex = DUPLEX_FULL,
+   };
+
+   if (!*phydev) {
+   netdev_dbg(dev->net, "PHY Not Found!! Registering Fixed PHY\n");
+   *phydev = fixed_phy_register(PHY_POLL, &fphy_status, -1,
+NULL);
+   if (IS_ERR(*phydev)) {
+   netdev_err(dev->net, "No PHY/fixed_PHY found\n");
+   return -ENODEV;
+   }
+   netdev_dbg(dev->net, "Registered FIXED PHY\n");
+   dev->interface = PHY_INTERFACE_MODE_RGMII;
+   dev->fixedphy = *phydev;
+   ret = lan78xx_write_reg(dev, MAC_RGMII_ID,
+   MAC_RGMII_ID_TXC_DELAY_EN_);
+   ret = lan78xx_write_reg(dev, RGMII_TX_BYP_DLL, 0x3D00);
+   ret = lan78xx_read_reg(dev, HW_CFG, &buf);
+   buf |= HW_CFG_CLK125_EN_;
+   buf |= HW_CFG_REFCLK25_EN_;
+   ret = lan78xx_write_reg(dev, HW_CFG, buf);
+   } else {
+   if (!(*phydev)->drv) {
netdev_err(dev->net, "no PHY driver found\n");
return -EIO;
}
-
dev->interface = PHY_INTERFACE_MODE_RGMII;
-
/* external PHY fixup for KSZ9031RNX */
ret = phy_register_fixup_for_uid(PHY_KSZ9031RNX, 0xfff0,
 ksz9031rnx_fixup);
if (ret < 0) {
-   netdev_err(dev->net, "fail to register fixup\n");
+   netdev_err(dev->net, "fail to register fixup 
PHY_KSZ9031RNX\n");
return ret;
}
/* external PHY fixup for LAN8835 */
ret = phy_register_fixup_for_uid(PHY_LAN8835, 0xfff0,
 lan8835_fixup);
if (ret < 0) {
-   netdev_err(dev->net, "fail to register fixup\n");
+   netdev_err(dev->net, "fail to register fixup for 
PHY_LAN8835\n");
   

[PATCH bpf-next] bpf: clear the ip_tunnel_info.

2018-04-24 Thread William Tu
The percpu metadata_dst might carry the stale ip_tunnel_info
and cause incorrect behavior.  When mixing tests using ipv4/ipv6
bpf vxlan and geneve tunnel, the ipv6 tunnel info incorrectly uses
ipv4's src ip addr as its ipv6 src address, because the previous
tunnel info does not clean up.  The patch zeros the fields in
ip_tunnel_info.

Signed-off-by: William Tu 
Reported-by: Yifeng Sun 
---
 net/core/filter.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 8e45c6c7ab08..d3781daa26ab 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3281,6 +3281,7 @@ BPF_CALL_4(bpf_skb_set_tunnel_key, struct sk_buff *, skb,
skb_dst_set(skb, (struct dst_entry *) md);
 
info = &md->u.tun_info;
+   memset(info, 0, sizeof(*info));
info->mode = IP_TUNNEL_INFO_TX;
 
info->key.tun_flags = TUNNEL_KEY | TUNNEL_CSUM | TUNNEL_NOCACHE;
-- 
2.7.4



Re: [PATCH] can: janz-ican3: fix ican3_xmit()'s return type

2018-04-24 Thread Marc Kleine-Budde
On 04/24/2018 03:16 PM, Luc Van Oostenryck wrote:
> The method ndo_start_xmit() is defined as returning an 'netdev_tx_t',
> which is a typedef for an enum type, but the implementation in this
> driver returns an 'int'.
> 
> Fix this by returning 'netdev_tx_t' in this driver too.
> 
> Signed-off-by: Luc Van Oostenryck 
> ---

Can you send a series fixing the return type in all CAN drivers?

Marc

-- 
Pengutronix e.K.  | Marc Kleine-Budde   |
Industrial Linux Solutions| Phone: +49-231-2826-924 |
Vertretung West/Dortmund  | Fax:   +49-5121-206917- |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |



signature.asc
Description: OpenPGP digital signature


e1000e I219 timestamping oops related to TSYNCRXCTL read

2018-04-24 Thread Benjamin Poirier
In the following openSUSE bug report
https://bugzilla.suse.com/show_bug.cgi?id=1075876
Achim reported an oops related to e1000e timestamping:
kernel: RIP: 0010:[] timecounter_read+0xf/0x50
[...]
kernel: Call Trace:
kernel:  [] e1000e_phc_gettime+0x2f/0x60 [e1000e]
kernel:  [] e1000e_systim_overflow_work+0x1d/0x80 [e1000e]
kernel:  [] process_one_work+0x155/0x440
kernel:  [] worker_thread+0x116/0x4b0
kernel:  [] kthread+0xd2/0xf0
kernel:  [] ret_from_fork+0x3f/0x70

It always occurs 4 hours after boot but not on every boot. It is most
likely the same problem reported here:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1668356
http://lkml.iu.edu/hypermail/linux/kernel/1506.2/index.html#02530
https://bugzilla.redhat.com/show_bug.cgi?id=1463882
https://bugzilla.redhat.com/show_bug.cgi?id=1431863

This occurs with MAC: 12, e1000_pch_spt/I219. The reporter has
reproduced it on a v4.16 derivative.

We've traced it to the fact that e1000e_systim_reset() skips the
timecounter_init() call if e1000e_get_base_timinca() returns -EINVAL,
which leads to a null deref in timecounter_read() (see comment 8 of the
suse bugzilla for more details.)

In commit 83129b37ef35 ("e1000e: fix systim issues", v4.2-rc1) Yanir
reworked e1000e_get_base_timinca() in such a way that it can return
-EINVAL for e1000_pch_spt if the SYSCFI bit is not set in TSYNCRXCTL.
This is also the commit that was identified by bisection in the second
link above (lkml).

What we've observed (in comment 14) is that TSYNCRXCTL reads sometimes
don't have the SYSCFI bit set. Retrying the read shortly after finds the
bit to be set. This was observed at boot (probe) but also link up and
link down.

I have a few questions:

What's the purpose of the SYSCFI bit in TSYNCRXCTL ("Reserved" in the
datasheet)?

Why does it look like subsequent reads of TSYNCRXCTL sometimes have the
SYSCFI bit set/not set on I219?

Is it right to check the SYSCFI bit in e1000e_get_base_timinca() for
_spt and return -EINVAL if it's not set? Could we just remove that
check?

The patch in comment 13 of the suse bugzilla works around the problem by
retrying TSYNCRXCTL reads, maybe we could instead remove that read
altogether or move the timecounter_init() call to at least avoid the
oops. The best approach to take seems to depend on the behavior expected
of TSYNCRXCTL reads on I219 so I'm hoping that you could provide more
info on that.

Thanks,
-Benjamin


Re: [PATCH 1/1] IB/rxe: avoid double kfree_skb

2018-04-24 Thread Yanjun Zhu

Hi, all

rxe_send [rdma_rxe]
    ip_local_out
        __ip_local_out
        ip_output
            ip_finish_output
                ip_finish_output2
                    dev_queue_xmit
                        __dev_queue_xmit
                            dev_hard_start_xmit
                                e1000_xmit_frame [e1000]

When skb is sent, it will pass the above functions. I checked all the 
above functions. If error occurs in the above functions after 
ip_local_out, kfree_skb will be called.
So when ip_local_out returns an error, skb should be freed. It is not 
necessary to call kfree_skb in soft roce module again.


If I am wrong, please correct me.

Zhu Yanjun
On 2018/4/24 16:34, Yanjun Zhu wrote:

Hi, all

rxe_send
    ip_local_out
        __ip_local_out
            nf_hook_slow

In the above call process, nf_hook_slow drops and frees skb, then 
-EPERM is returned when iptables rules(iptables -I OUTPUT -p udp 
--dport 4791 -j DROP) is set.


If skb->users is not changed in softroce, kfree_skb should not be 
called in this module.


I will make further investigations about other error handler after 
ip_local_out.

If I am wrong, please correct me.

Any reply is appreciated.

Zhu Yanjun
On 2018/4/20 13:46, Yanjun Zhu wrote:



On 2018/4/20 10:19, Doug Ledford wrote:

On Thu, 2018-04-19 at 10:01 -0400, Zhu Yanjun wrote:
When skb is dropped by iptables rules, the skb is freed at the same 
time
-EPERM is returned. So in softroce, it is not necessary to free skb 
again.

Or else, crash will occur.

The steps to reproduce:

  server   client
 -    -
 |1.1.1.1||1.1.1.2|
 -    -

On server: rping -s -a 1.1.1.1 -v -C 1 -S 512
On client: rping -c -a 1.1.1.1 -v -C 1 -S 512

The kernel configs CONFIG_DEBUG_KMEMLEAK and
CONFIG_DEBUG_OBJECTS are enabled on both server and client.

When rping runs, run the following command in server:

iptables -I OUTPUT -p udp  --dport 4791 -j DROP

Without this patch, crash will occur.

CC: Srinivas Eeda 
CC: Junxiao Bi 
Signed-off-by: Zhu Yanjun 
Reviewed-by: Yuval Shaia 

I have no reason to doubt your analysis, but if there are a bunch of
error paths for net_xmit and they all return with your skb still being
valid and holding a reference, and then one oddball that returns with
your skb already gone, that just sounds like a mistake waiting to 
happen

(not to mention a bajillion special cases sprinkled everywhere to deal
with this apparent inconsistency).

Can we get a netdev@ confirmation on this being the right solution?

Yes. I agree with you.
After iptables rule "iptables -I OUTPUT -p udp  --dport 4791 -j 
DROP", the skb is freed in this function


/* Returns 1 if okfn() needs to be executed by the caller,
 * -EPERM for NF_DROP, 0 otherwise.  Caller must hold rcu_read_lock. */
int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state,
 const struct nf_hook_entries *e, unsigned int s)
{
    unsigned int verdict;
    int ret;

    for (; s < e->num_hook_entries; s++) {
    verdict = nf_hook_entry_hookfn(&e->hooks[s], skb, 
state);

    switch (verdict & NF_VERDICT_MASK) {
    case NF_ACCEPT:
    break;
    case NF_DROP:
kfree_skb(skb);   

<    1   2   3   4