Re: [PATCH net-next v2 1/2] fs/crashdd: add API to collect hardware dump in second kernel

2018-04-03 Thread Jiri Pirko
Mon, Apr 02, 2018 at 02:30:45PM CEST, rahul.lakkire...@chelsio.com wrote:
>On Monday, April 04/02/18, 2018 at 14:41:43 +0530, Jiri Pirko wrote:
>> Fri, Mar 30, 2018 at 08:42:00PM CEST, ebied...@xmission.com wrote:
>> >Rahul Lakkireddy  writes:
>> >
>> >> On Friday, March 03/30/18, 2018 at 16:09:07 +0530, Jiri Pirko wrote:
>> >>> Sat, Mar 24, 2018 at 11:56:33AM CET, rahul.lakkire...@chelsio.com wrote:
>> >>> >Add a new module crashdd that exports the /sys/kernel/crashdd/
>> >>> >directory in second kernel, containing collected hardware/firmware
>> >>> >dumps.
>> >>> >
>> >>> >The sequence of actions done by device drivers to append their device
>> >>> >specific hardware/firmware logs to /sys/kernel/crashdd/ directory are
>> >>> >as follows:
>> >>> >
>> >>> >1. During probe (before hardware is initialized), device drivers
>> >>> >register to the crashdd module (via crashdd_add_dump()), with
>> >>> >callback function, along with buffer size and log name needed for
>> >>> >firmware/hardware log collection.
>> >>> >
>> >>> >2. Crashdd creates a driver's directory under
>> >>> >/sys/kernel/crashdd/. Then, it allocates the buffer with
>> >>> 
>> >>> This smells. I need to identify the exact ASIC instance that produced
>> >>> the dump. To identify by driver name does not help me if I have multiple
>> >>> instances of the same driver. This looks wrong to me. This looks like
>> >>> a job for devlink where you have 1 devlink instance per 1 ASIC instance.
>> >>> 
>> >>> Please see:
>> >>> http://patchwork.ozlabs.org/project/netdev/list/?series=36524
>> >>> 
>> >>> I bevieve that the solution in the patchset could be used for
>> >>> your usecase too.
>> >>> 
>> >>> 
>> >>
>> >> The sysfs approach proposed here had been dropped in favour exporting
>> >> the dumps as ELF notes in /proc/vmcore.
>> >>
>> >> Will be posting the new patches soon.
>> >
>> >The concern was actually how you identify which device that came from.
>> >Where you read the identifier changes but sysfs or /proc/vmcore the
>> >change remains valid.
>> 
>> Yeah. I still don't see how you link the dump and the device.
>
>In our case, the dump and the device are being identified by the
>driver’s name followed by its corresponding pci bus id.  I’ve posted an
>example in my v3 series:
>
>https://www.spinics.net/lists/netdev/msg493781.html
>
>Here’s an extract from the link above:
>
># readelf -n /proc/vmcore
>
>Displaying notes found at file offset 0x1000 with length 0x04003288:
>Owner Data size Description
>VMCOREDD_cxgb4_:02:00.4 0x02000fd8  Unknown note type:(0x0700)
>VMCOREDD_cxgb4_:04:00.4 0x02000fd8  Unknown note type:(0x0700)
>CORE 0x0150 NT_PRSTATUS (prstatus structure)
>CORE 0x0150 NT_PRSTATUS (prstatus structure)
>CORE 0x0150 NT_PRSTATUS (prstatus structure)
>CORE 0x0150 NT_PRSTATUS (prstatus structure)
>CORE 0x0150 NT_PRSTATUS (prstatus structure)
>CORE 0x0150 NT_PRSTATUS (prstatus structure)
>CORE 0x0150 NT_PRSTATUS (prstatus structure)
>CORE 0x0150 NT_PRSTATUS (prstatus structure)
>VMCOREINFO   0x074f Unknown note type: (0x)
>
>Here, for my two devices, the dump’s names are
>VMCOREDD_cxgb4_:02:00.4 and VMCOREDD_cxgb4_:04:00.4.
>
>It’s really up to the callers to write their own unique name for the
>dump.  The name is appended to “VMCOREDD_” string.
>
>> Rahul, did you look at the patchset I pointed out?
>
>For devlink, I think the dump name would be identified by
>bus_type/device_name; i.e. “pci/:02:00.4” for my example.
>Is my understanding correct?

Yes.


>
>Thanks,
>Rahul


Re: [PATCH 14/15] ARM: pxa: change SSP devices allocation

2018-04-03 Thread Arnd Bergmann
On Mon, Apr 2, 2018 at 4:26 PM, Robert Jarzmik  wrote:

>
> +static struct pxa_ssp_info pxa_ssp_infos[] = {
> +   { .dma_chan_rx_name = "ssp1_rx", .dma_chan_tx_name = "ssp1_tx", },
> +   { .dma_chan_rx_name = "ssp1_rx", .dma_chan_tx_name = "ssp1_tx", },
> +   { .dma_chan_rx_name = "ssp2_rx", .dma_chan_tx_name = "ssp2_tx", },
> +   { .dma_chan_rx_name = "ssp2_rx", .dma_chan_tx_name = "ssp2_tx", },
> +   { .dma_chan_rx_name = "ssp3_rx", .dma_chan_tx_name = "ssp3_tx", },
> +   { .dma_chan_rx_name = "ssp3_rx", .dma_chan_tx_name = "ssp3_tx", },
> +   { .dma_chan_rx_name = "ssp4_rx", .dma_chan_tx_name = "ssp4_tx", },
> +   { .dma_chan_rx_name = "ssp4_rx", .dma_chan_tx_name = "ssp4_tx", },
> +};

This part looks odd to me, you're adding an extra level of indirection to
do two stages of lookups in some form of platform data.

Why can't you just always use "rx" and "tx" as the names here?

(also, I don't see why each line is duplicated, but I'm sure there's
an easy answer for that).

   Arnd


Re: [PATCH 12/15] dmaengine: pxa: make the filter function internal

2018-04-03 Thread Arnd Bergmann
On Mon, Apr 2, 2018 at 6:35 PM, kbuild test robot  wrote:

>
>drivers/mtd/nand/marvell_nand.c:2621:17: sparse: undefined identifier 
> 'pxad_filter_fn'
>>> drivers/mtd/nand/marvell_nand.c:2621:17: sparse: call with no type!
>In file included from drivers/mtd/nand/marvell_nand.c:21:0:
>drivers/mtd/nand/marvell_nand.c: In function 'marvell_nfc_init_dma':
>drivers/mtd/nand/marvell_nand.c:2621:42: error: 'pxad_filter_fn' 
> undeclared (first use in this function); did you mean 'dma_filter_fn'?
>   dma_request_slave_channel_compat(mask, pxad_filter_fn,
>  ^
>include/linux/dmaengine.h:1408:46: note: in definition of macro 
> 'dma_request_slave_channel_compat'
>  __dma_request_slave_channel_compat(&(mask), x, y, dev, name)
>  ^
>drivers/mtd/nand/marvell_nand.c:2621:42: note: each undeclared identifier 
> is reported only once for each function it appears in
>   dma_request_slave_channel_compat(mask, pxad_filter_fn,
>  ^
>include/linux/dmaengine.h:1408:46: note: in definition of macro 
> 'dma_request_slave_channel_compat'
>  __dma_request_slave_channel_compat(&(mask), x, y, dev, name)

The driver is a replacement for the pxa3xx nand driver, so it now has
to get changed as well.

   Arnd


Re: [PATCH 00/15] ARM: pxa: switch to DMA slave maps

2018-04-03 Thread Arnd Bergmann
On Mon, Apr 2, 2018 at 4:26 PM, Robert Jarzmik  wrote:
> Hi,
>
> This serie is aimed at removing the dmaengine slave compat use, and transfer
> knowledge of the DMA requestors into architecture code.
>
> This was discussed/advised by Arnd a couple of years back, it's almost time.
>
> The serie is divided in 3 phasees :
>  - phase 1 : patch 1/15 and patch 2/15
>=> this is the preparation work
>  - phase 2 : patches 3/15 .. 10/15
>=> this is the switch of all the drivers
>=> this one will require either an Ack of the maintainers or be taken by 
> them
>   once phase 1 is merged
>  - phase 3 : patches 11/15
>=> this is the last part, cleanup and removal of export of the DMA filter
>   function
>
> As this looks like a patch bomb, each maintainer expressing for his tree 
> either
> an Ack or "I want to take through my tree" will be spared in the next 
> iterations
> of this serie.
>
> Several of these changes have been tested on actual hardware, including :
>  - pxamci
>  - pxa_camera
>  - smc*
>  - ASoC and SSP
>
> Happy review.

This looks really great overall, thanks for cleaning this up!

The SSP part is still a bit weird, as I commented, but I'm sure we can
figure something out there.

Arnd


[PATCH iproute2 rdma: Ignore unknown netlink attributes

2018-04-03 Thread Leon Romanovsky
From: Leon Romanovsky 

The check if netlink attributes supplied more than maximum supported
is to strict and may lead to backward compatibility issues with old
application with a newer kernel that supports new attribute.

CC: Steve Wise 
Fixes: 74bd75c2b68d ("rdma: Add basic infrastructure for RDMA tool")
Signed-off-by: Leon Romanovsky 
---
 rdma/utils.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/rdma/utils.c b/rdma/utils.c
index a2e08e91..5c1e736a 100644
--- a/rdma/utils.c
+++ b/rdma/utils.c
@@ -399,7 +399,8 @@ int rd_attr_cb(const struct nlattr *attr, void *data)
int type;

if (mnl_attr_type_valid(attr, RDMA_NLDEV_ATTR_MAX) < 0)
-   return MNL_CB_ERROR;
+   /* We received uknown attribute */
+   return MNL_CB_OK;

type = mnl_attr_get_type(attr);

--
2.14.3



Re: [PATCH net-next 09/11] devlink: convert occ_get op to separate registration

2018-04-03 Thread Jiri Pirko
Fri, Mar 30, 2018 at 04:45:50PM CEST, dsah...@gmail.com wrote:
>On 3/29/18 2:33 PM, Ido Schimmel wrote:
>> From: Jiri Pirko 
>> 
>> This resolves race during initialization where the resources with
>> ops are registered before driver and the structures used by occ_get
>> op is initialized. So keep occ_get callbacks registered only when
>> all structs are initialized.
>
>Why can't the occ_get handler look at some flag in an mlxsw struct to
>know if the system has initialized?
>
>Separate registration here is awkward. You register a resource and then
>register its op later.

The separation is exactly why this patch is made. Note that devlink
resouce is registered by core way before the initialization is done and
the driver is actually able to perform the op. Also consider "reload"
case, when the resource is still registered and the driver unloads and
loads again. For that makes perfect sense to have that separated.
Flag would just make things odd. Also, the priv could not be used in
that case.



>
>Also, this should be a standalone patch rather than embedded in a
>'mlxsw: Various cleanups' set.
>
>
>> 
>> Signed-off-by: Jiri Pirko 
>> Signed-off-by: Ido Schimmel 
>> ---
>>  drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 24 ++-
>>  drivers/net/ethernet/mellanox/mlxsw/spectrum.h |  1 -
>>  .../net/ethernet/mellanox/mlxsw/spectrum_kvdl.c| 67 
>>  include/net/devlink.h  | 40 
>>  net/core/devlink.c | 74 
>> +++---
>>  5 files changed, 134 insertions(+), 72 deletions(-)
>> 
>> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
>> b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
>> index b831af38e0a1..0d95d2cb73e3 100644
>> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
>> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
>> @@ -3805,18 +3805,6 @@ static const struct mlxsw_config_profile 
>> mlxsw_sp_config_profile = {
>>  },
>>  };
>>  
>> -static u64 mlxsw_sp_resource_kvd_linear_occ_get(struct devlink *devlink)
>> -{
>> -struct mlxsw_core *mlxsw_core = devlink_priv(devlink);
>> -struct mlxsw_sp *mlxsw_sp = mlxsw_core_driver_priv(mlxsw_core);
>> -
>> -return mlxsw_sp_kvdl_occ_get(mlxsw_sp);
>> -}
>> -
>> -static const struct devlink_resource_ops mlxsw_sp_resource_kvd_linear_ops = 
>> {
>> -.occ_get = mlxsw_sp_resource_kvd_linear_occ_get,
>> -};
>> -
>>  static void
>>  mlxsw_sp_resource_size_params_prepare(struct mlxsw_core *mlxsw_core,
>>struct devlink_resource_size_params 
>> *kvd_size_params,
>> @@ -3877,8 +3865,7 @@ static int mlxsw_sp_resources_register(struct 
>> mlxsw_core *mlxsw_core)
>>  err = devlink_resource_register(devlink, MLXSW_SP_RESOURCE_NAME_KVD,
>>  kvd_size, MLXSW_SP_RESOURCE_KVD,
>>  DEVLINK_RESOURCE_ID_PARENT_TOP,
>> -&kvd_size_params,
>> -NULL);
>> +&kvd_size_params);
>>  if (err)
>>  return err;
>>  
>> @@ -3887,8 +3874,7 @@ static int mlxsw_sp_resources_register(struct 
>> mlxsw_core *mlxsw_core)
>>  linear_size,
>>  MLXSW_SP_RESOURCE_KVD_LINEAR,
>>  MLXSW_SP_RESOURCE_KVD,
>> -&linear_size_params,
>> -&mlxsw_sp_resource_kvd_linear_ops);
>> +&linear_size_params);
>>  if (err)
>>  return err;
>>  
>> @@ -3905,8 +3891,7 @@ static int mlxsw_sp_resources_register(struct 
>> mlxsw_core *mlxsw_core)
>>  double_size,
>>  MLXSW_SP_RESOURCE_KVD_HASH_DOUBLE,
>>  MLXSW_SP_RESOURCE_KVD,
>> -&hash_double_size_params,
>> -NULL);
>> +&hash_double_size_params);
>>  if (err)
>>  return err;
>>  
>> @@ -3915,8 +3900,7 @@ static int mlxsw_sp_resources_register(struct 
>> mlxsw_core *mlxsw_core)
>>  single_size,
>>  MLXSW_SP_RESOURCE_KVD_HASH_SINGLE,
>>  MLXSW_SP_RESOURCE_KVD,
>> -&hash_single_size_params,
>> -NULL);
>> +&hash_single_size_params);
>>  if (err)
>>  return err;
>>  
>> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h 
>> b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
>> index 21bee8f19894..c59a0d7d81d5 100644
>> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
>> +++ b/drivers/net/ethernet/mellanox/ml

Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-03 Thread Siwei Liu
On Sun, Apr 1, 2018 at 9:11 AM, David Ahern  wrote:
> On 4/1/18 3:13 AM, Si-Wei Liu wrote:
>> Hidden netdevice is not visible to userspace such that
>> typical network utilities e.g. ip, ifconfig and et al,
>> cannot sense its existence or configure it. Internally
>> hidden netdev may associate with an upper level netdev
>> that userspace has access to. Although userspace cannot
>> manipulate the lower netdev directly, user may control
>> or configure the underlying hidden device through the
>> upper-level netdev. For identification purpose, the
>> kobject for hidden netdev still presents in the sysfs
>> hierarchy, however, no uevent message will be generated
>> when the sysfs entry is created, modified or destroyed.
>>
>> For that end, a separate namescope needs to be carved
>> out for IFF_HIDDEN netdevs. As of now netdev name that
>> starts with colon i.e. ':' is invalid in userspace,
>> since socket ioctls such as SIOCGIFCONF use ':' as the
>> separator for ifname. The absence of namescope started
>> with ':' can rightly be used as the namescope for
>> the kernel-only IFF_HIDDEN netdevs.
>>
>> Signed-off-by: Si-Wei Liu 
>> ---
>>  include/linux/netdevice.h   |  12 ++
>>  include/net/net_namespace.h |   2 +
>>  net/core/dev.c  | 281 
>> ++--
>>  net/core/net_namespace.c|   1 +
>>  4 files changed, 263 insertions(+), 33 deletions(-)
>>
>
> There are other use cases that want to hide a device from userspace.

Can you elaborate your case in more details? Looking at the links
below I realize that the purpose of hiding devices in your case is
quite different from the our migration case. Particularly, I don't
like the part of elaborately allowing user to manipulate the link's
visibility - things fall apart easily while live migration is on
going. And, why doing additional check for invisible links in every
for_each_netdev() and its friends. This is effectively changing
semantics of internal APIs that exist for decades.

> I would prefer a better solution than playing games with name prefixes and

This part is intentionally left to be that way and I would anticipate
feedback before going too far. The idea in my mind was that I need a
completely new device namespace underneath (or namescope, which is !=
netns) for all netdevs: kernel-only IFF_HIDDEN network devices and
those not. The current namespace for devname is already exposed to
userspace and visible in the sysfs hierarchy, but for backwards
compatibility reasons it's necessary to keep the old udevd still able
to reference the entry of IFF_HIDDEN netdev under the /sys/net
directory. By using the ':' prefix it has the benefit of minimal
changes without introducing another namespace or the accompanied
complexity in managing these two separate namespaces.

Technically, I can create a separate sysfs directory for the new
namescope, say /sys/net-kernel, which includes all netdev entries like
':eth0' and 'ens3', and which can be referenced from /sys/net. It
would make the /sys/net consistent with the view of userspace
utilities, but I am not even sure if that's an overkill, and would
like to gather more feedback before moving to that model.

> one that includes an API for users to list all devices -- even ones

What kind of API you would like to query for hidden devices?
rtnetlink? a private socket API? or something else?

For our case, the sysfs interface is what we need and is sufficient,
since udev is the main target we'd like to support to make the naming
of virtio_bypass consistent and compatible.

> hidden by default.
>
> https://github.com/dsahern/linux/commit/48a80a00eac284e58bae04af10a5a932dd7aee00
>
> https://github.com/dsahern/iproute2/commit/7563f5b26f5539960e99066e34a995d22ea908ed
>
> Also, why are you suggesting that the device should still be visible via
> /sysfs? That leads to inconsistent views of networking state - /sys
> shows a device but a link dump does not.

See my clarifications above. I don't mind kernel-only netdevs being
visible via sysfs, as that way we get a good trade-off between
backwards compatibility and visibility. There's still kobject created
there right. Bottom line is that all kernel devices and its life-cycle
uevents are made invisible to userpace network utilities, and I think
it simply gets to the goal of not breaking existing apps while being
able to add new features.

-Siwei


[PATCH] vhost-net: add limitation of sent packets for tx polling

2018-04-03 Thread 张海斌
handle_tx will delay rx for a long time when tx busy polling udp packets
with small length(e.g. 1byte udp payload), because setting VHOST_NET_WEIGHT
takes into account only sent-bytes but no single packet length.

Tests were done between two Virtual Machines using netperf(UDP_STREAM, len=1),
then another machine pinged the client. Result shows as follow:

Packet#   Ping-Latency(ms)
  min avg max
Origin  3.319  18.489  57.503
64  1.643   2.021   2.552
128 1.825   2.600   3.224
256 1.997   2.710   4.295
512*1.860   3.171   4.631
10242.002   4.173   9.056
20482.257   5.650   9.688
40962.093   8.508  15.943

512 is selected, which is multi-VRING_SIZE and close to VHOST_NET_WEIGHT/MTU.
To evaluate this change, another tests were done using netperf(RR, TX) between
two machines with Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz. Result as follow
does not show obvious changes:

TCP_RR

size/sessions/+thu%/+normalize%
   1/   1/  -7%/-2%
   1/   4/  +1%/ 0%
   1/   8/  +1%/-2%
  64/   1/  -6%/ 0%
  64/   4/   0%/+2%
  64/   8/   0%/ 0%
 256/   1/  -3%/-4%
 256/   4/  +3%/+4%
 256/   8/  +2%/ 0%

UDP_RR

size/sessions/+thu%/+normalize%
   1/   1/  -5%/+1%
   1/   4/  +4%/+1%
   1/   8/  -1%/-1%
  64/   1/  -2%/-3%
  64/   4/  -5%/-1%
  64/   8/   0%/-1%
 256/   1/  +7%/+1%
 256/   4/  +1%/+1%
 256/   8/  +2%/+2%

TCP_STREAM

size/sessions/+thu%/+normalize%
  64/   1/   0%/-3%
  64/   4/  +3%/-1%
  64/   8/  +9%/-4%
 256/   1/  +1%/-4%
 256/   4/  -1%/-1%
 256/   8/  +7%/+5%
 512/   1/  +1%/ 0%
 512/   4/  +1%/-1%
 512/   8/  +7%/-5%
1024/   1/   0%/-1%
1024/   4/  +3%/ 0%
1024/   8/  +8%/+5%
2048/   1/  +2%/+2%
2048/   4/  +1%/ 0%
2048/   8/  -2%/ 0%
4096/   1/  -2%/ 0%
4096/   4/  +2%/ 0%
4096/   8/  +9%/-2%

Signed-off-by: Haibin Zhang 
Signed-off-by: Yunfang Tai 
Signed-off-by: Lidong Chen 
---
 drivers/vhost/net.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 8139bc70ad7d..13a23f3f3ea4 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -44,6 +44,10 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
  * Using this limit prevents one virtqueue from starving others. */
 #define VHOST_NET_WEIGHT 0x8
 
+/* Max number of packets transferred before requeueing the job.
+ * Using this limit prevents one virtqueue from starving rx. */
+#define VHOST_NET_PKT_WEIGHT 512
+
 /* MAX number of TX used buffers for outstanding zerocopy */
 #define VHOST_MAX_PEND 128
 #define VHOST_GOODCOPY_LEN 256
@@ -473,6 +477,7 @@ static void handle_tx(struct vhost_net *net)
struct socket *sock;
struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
bool zcopy, zcopy_used;
+   int sent_pkts = 0;
 
mutex_lock(&vq->mutex);
sock = vq->private_data;
@@ -580,7 +585,8 @@ static void handle_tx(struct vhost_net *net)
else
vhost_zerocopy_signal_used(net, vq);
vhost_net_tx_packet(net);
-   if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
+   if (unlikely(total_len >= VHOST_NET_WEIGHT) ||
+   unlikely(++sent_pkts >= VHOST_NET_PKT_WEIGHT)) {
vhost_poll_queue(&vq->poll);
break;
}
-- 
2.12.3



Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-04-03 Thread Pavel Machek
Hi!

> > On Tue, 20 Mar 2018, Ingo Molnar wrote:
> > > * Thomas Gleixner  wrote:
> > > 
> > > > > So I do think we could do more in this area to improve driver 
> > > > > performance, if the 
> > > > > code is correct and if there's actual benchmarks that are showing 
> > > > > real benefits.
> > > > 
> > > > If it's about hotpath performance I'm all for it, but the use case here 
> > > > is
> > > > a debug facility...
> > > > 
> > > > And if we go down that road then we want a AVX based memcpy()
> > > > implementation which is runtime conditional on the feature bit(s) and
> > > > length dependent. Just slapping a readqq() at it and use it in a loop 
> > > > does
> > > > not make any sense.
> > > 
> > > Yeah, so generic memcpy() replacement is only feasible I think if the 
> > > most 
> > > optimistic implementation is actually correct:
> > > 
> > >  - if no preempt disable()/enable() is required
> > > 
> > >  - if direct access to the AVX[2] registers does not disturb legacy FPU 
> > > state in 
> > >any fashion
> > > 
> > >  - if direct access to the AVX[2] registers cannot raise weird exceptions 
> > > or have
> > >weird behavior if the FPU control word is modified to non-standard 
> > > values by 
> > >untrusted user-space
> > > 
> > > If we have to touch the FPU tag or control words then it's probably only 
> > > good for 
> > > a specialized API.
> > 
> > I did not mean to have a general memcpy replacement. Rather something like
> > magic_memcpy() which falls back to memcpy when AVX is not usable or the
> > length does not justify the AVX stuff at all.
> 
> OK, fair enough.
> 
> Note that a generic version might still be worth trying out, if and only if 
> it's 
> safe to access those vector registers directly: modern x86 CPUs will do their 
> non-constant memcpy()s via the common memcpy_erms() function - which could in 
> theory be an easy common point to be (cpufeatures-) patched to an AVX2 
> variant, if 
> size (and alignment, perhaps) is a multiple of 32 bytes or so.

How is AVX2 supposed to help the memcpy speed?

If the copy is small, constant overhead will dominate, and I don't
think AVX2 is going to be win there.

If the copy is big, well, the copy loop will likely run out of L1 and
maybe even out of L2, and at that point speed of the loop does not
matter because memory is slow...?

Best regards,
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


linux-next on x60: hangs when I request suspend was Re: linux-next on x60: network manager often complains "network is disabled" after resume

2018-04-03 Thread Pavel Machek
Hi!

I wanted to re-test next (4.16.0-rc7-next-20180329), but that one does
not suspend at all.

I normally suspend by pressing power button in MATE, but that action
currently results in machine hanging.

Pavel


On Mon 2018-03-26 10:33:55, Dan Williams wrote:
> On Sun, 2018-03-25 at 08:19 +0200, Pavel Machek wrote:
> > > > > Ok, what does 'nmcli dev' and 'nmcli radio' show?
> > > > 
> > > > Broken state.
> > > > 
> > > > pavel@amd:~$ nmcli dev
> > > > DEVICE  TYPE  STATECONNECTION
> > > > eth1ethernet  unavailable  --
> > > > lo  loopback  unmanaged--
> > > > wlan0   wifi  unmanaged--
> > > 
> > > If the state is "unmanaged" on resume, that would indicate a
> > > problem
> > > with sleep/wake and likely not a kernel network device issue.
> > > 
> > > We should probably move this discussion to the NM lists to debug
> > > further.  Before you suspend, run "nmcli gen log level trace" to
> > > turn
> > > on full debug logging, then reproduce the issue, and send a pointer
> > > to
> > > those logs (scrubbed for anything you consider sensitive) to the NM
> > > mailing list.
> > 
> > Hmm :-)
> > 
> > root@amd:/data/pavel# nmcli gen log level trace
> > Error: Unknown log level 'trace'
> 
> What NM version?  'trace' is pretty old (since 1.0 from December 2014)
> so unless you're using a really, really old version of Debian I'd
> expect you'd have it.  Anyway, debug would do.
> 
> > root@amd:/data/pavel# nmcli gen log level help
> > Error: Unknown log level 'help'
> 
> nmcli gen help
> 
> > root@amd:/data/pavel# nmcli gen log level
> > Error: value for 'level' argument is required.
> > root@amd:/data/pavel# nmcli gen log level debug
> 
> This should be OK.
> 
> > root@amd:/data/pavel# cat /var/log/sys/log
> 
> It routes it to whatever the syslog 'daemon' facility logs to (however
> that's configured on your system).  Usually /var/log/messages or
> /var/log/daemon.log or sometimes your distro configures it to
> /var/log/NetworkManager.log.
> 
> Or if you're using a systemd-based distro, it would probably be in the
> systemd journal so "journalctl -b -u NetworkManager"
> 
> > Where do I get the logs? I don't see much in the syslog...
> 
> > And.. It seems that it is "every other suspend". One resume results
> > in
> > broken network, one in working one, one in broken one...
> 
> Does your distro use pm-utils, upower, or systemd for suspend/resume
> handling?
> 
> Dan

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: [Patch net] af_unix: remove redundant lockdep class

2018-04-03 Thread Paolo Abeni
On Mon, 2018-04-02 at 11:01 -0700, Cong Wang wrote:
> After commit 581319c58600 ("net/socket: use per af lockdep classes for sk 
> queues")
> sock queue locks now have per-af lockdep classes, including unix socket.
> It is no longer necessary to workaround it.
> 
> I noticed this while looking at a syzbot deadlock report, this patch
> itself doesn't fix it (this is why I don't add Reported-by).
> 
> Fixes: 581319c58600 ("net/socket: use per af lockdep classes for sk queues")
> Cc: Paolo Abeni 
> Signed-off-by: Cong Wang 
> ---
>  net/unix/af_unix.c | 10 --
>  1 file changed, 10 deletions(-)
> 
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 2d465bdeccbc..45971e173924 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -745,14 +745,6 @@ static struct proto unix_proto = {
>   .obj_size   = sizeof(struct unix_sock),
>  };
>  
> -/*
> - * AF_UNIX sockets do not interact with hardware, hence they
> - * dont trigger interrupts - so it's safe for them to have
> - * bh-unsafe locking for their sk_receive_queue.lock. Split off
> - * this special lock-class by reinitializing the spinlock key:
> - */
> -static struct lock_class_key af_unix_sk_receive_queue_lock_key;
> -
>  static struct sock *unix_create1(struct net *net, struct socket *sock, int 
> kern)
>  {
>   struct sock *sk = NULL;
> @@ -767,8 +759,6 @@ static struct sock *unix_create1(struct net *net, struct 
> socket *sock, int kern)
>   goto out;
>  
>   sock_init_data(sock, sk);
> - lockdep_set_class(&sk->sk_receive_queue.lock,
> - &af_unix_sk_receive_queue_lock_key);
>  
>   sk->sk_allocation   = GFP_KERNEL_ACCOUNT;
>   sk->sk_write_space  = unix_write_space;

LGTM

Acked-by: Paolo Abeni 



Re: linux-next: build failure after merge of the tip tree

2018-04-03 Thread Peter Zijlstra
On Tue, Apr 03, 2018 at 03:41:22PM +1000, Stephen Rothwell wrote:

> Caused by commit
> 
>   9b8cce52c4b5 ("sched/wait: Remove the wait_on_atomic_t() API")
> 
> interacting with commits
> 
>   d3be4d244330 ("xrpc: Fix potential call vs socket/net destruction race")
>   31f5f9a1691e ("rxrpc: Fix apparent leak of rxrpc_local objects")
> 
> from the net-next tree.
> 
> Haven't we figured out how to remove/change APIs yet? :-(

I figured that since there were only a handful of users it wasn't a
popular API, also David very much knew of those patches changing it so
could easily have pulled in the special tip/sched/wait branch :/



[PATCH v2] net: phy: marvell10g: add thermal hwmon device

2018-04-03 Thread Russell King
Add a thermal monitoring device for the Marvell 88x3310, which updates
once a second.  We also need to hook into the suspend/resume mechanism
to ensure that the thermal monitoring is reconfigured when we resume.

Suggested-by: Andrew Lunn 
Signed-off-by: Russell King 
---
v2: update to apply to net-next

 drivers/net/phy/marvell10g.c | 184 ++-
 1 file changed, 182 insertions(+), 2 deletions(-)

diff --git a/drivers/net/phy/marvell10g.c b/drivers/net/phy/marvell10g.c
index 8a0bd98fdec7..db9d66781da6 100644
--- a/drivers/net/phy/marvell10g.c
+++ b/drivers/net/phy/marvell10g.c
@@ -21,8 +21,10 @@
  * If both the fiber and copper ports are connected, the first to gain
  * link takes priority and the other port is completely locked out.
  */
-#include 
+#include 
+#include 
 #include 
+#include 
 
 enum {
MV_PCS_BASE_T   = 0x,
@@ -40,6 +42,19 @@ enum {
 */
MV_AN_CTRL1000  = 0x8000, /* 1000base-T control register */
MV_AN_STAT1000  = 0x8001, /* 1000base-T status register */
+
+   /* Vendor2 MMD registers */
+   MV_V2_TEMP_CTRL = 0xf08a,
+   MV_V2_TEMP_CTRL_MASK= 0xc000,
+   MV_V2_TEMP_CTRL_SAMPLE  = 0x,
+   MV_V2_TEMP_CTRL_DISABLE = 0xc000,
+   MV_V2_TEMP  = 0xf08c,
+   MV_V2_TEMP_UNKNOWN  = 0x9600, /* unknown function */
+};
+
+struct mv3310_priv {
+   struct device *hwmon_dev;
+   char *hwmon_name;
 };
 
 static int mv3310_modify(struct phy_device *phydev, int devad, u16 reg,
@@ -60,17 +75,180 @@ static int mv3310_modify(struct phy_device *phydev, int 
devad, u16 reg,
return ret < 0 ? ret : 1;
 }
 
+#ifdef CONFIG_HWMON
+static umode_t mv3310_hwmon_is_visible(const void *data,
+  enum hwmon_sensor_types type,
+  u32 attr, int channel)
+{
+   if (type == hwmon_chip && attr == hwmon_chip_update_interval)
+   return 0444;
+   if (type == hwmon_temp && attr == hwmon_temp_input)
+   return 0444;
+   return 0;
+}
+
+static int mv3310_hwmon_read(struct device *dev, enum hwmon_sensor_types type,
+u32 attr, int channel, long *value)
+{
+   struct phy_device *phydev = dev_get_drvdata(dev);
+   int temp;
+
+   if (type == hwmon_chip && attr == hwmon_chip_update_interval) {
+   *value = MSEC_PER_SEC;
+   return 0;
+   }
+
+   if (type == hwmon_temp && attr == hwmon_temp_input) {
+   temp = phy_read_mmd(phydev, MDIO_MMD_VEND2, MV_V2_TEMP);
+   if (temp < 0)
+   return temp;
+
+   *value = ((temp & 0xff) - 75) * 1000;
+
+   return 0;
+   }
+
+   return -EOPNOTSUPP;
+}
+
+static const struct hwmon_ops mv3310_hwmon_ops = {
+   .is_visible = mv3310_hwmon_is_visible,
+   .read = mv3310_hwmon_read,
+};
+
+static u32 mv3310_hwmon_chip_config[] = {
+   HWMON_C_REGISTER_TZ | HWMON_C_UPDATE_INTERVAL,
+   0,
+};
+
+static const struct hwmon_channel_info mv3310_hwmon_chip = {
+   .type = hwmon_chip,
+   .config = mv3310_hwmon_chip_config,
+};
+
+static u32 mv3310_hwmon_temp_config[] = {
+   HWMON_T_INPUT,
+   0,
+};
+
+static const struct hwmon_channel_info mv3310_hwmon_temp = {
+   .type = hwmon_temp,
+   .config = mv3310_hwmon_temp_config,
+};
+
+static const struct hwmon_channel_info *mv3310_hwmon_info[] = {
+   &mv3310_hwmon_chip,
+   &mv3310_hwmon_temp,
+   NULL,
+};
+
+static const struct hwmon_chip_info mv3310_hwmon_chip_info = {
+   .ops = &mv3310_hwmon_ops,
+   .info = mv3310_hwmon_info,
+};
+
+static int mv3310_hwmon_config(struct phy_device *phydev, bool enable)
+{
+   u16 val;
+   int ret;
+
+   ret = phy_write_mmd(phydev, MDIO_MMD_VEND2, MV_V2_TEMP,
+   MV_V2_TEMP_UNKNOWN);
+   if (ret < 0)
+   return ret;
+
+   val = enable ? MV_V2_TEMP_CTRL_SAMPLE : MV_V2_TEMP_CTRL_DISABLE;
+   ret = mv3310_modify(phydev, MDIO_MMD_VEND2, MV_V2_TEMP_CTRL,
+   MV_V2_TEMP_CTRL_MASK, val);
+
+   return ret < 0 ? ret : 0;
+}
+
+static void mv3310_hwmon_disable(void *data)
+{
+   struct phy_device *phydev = data;
+
+   mv3310_hwmon_config(phydev, false);
+}
+
+static int mv3310_hwmon_probe(struct phy_device *phydev)
+{
+   struct device *dev = &phydev->mdio.dev;
+   struct mv3310_priv *priv = dev_get_drvdata(&phydev->mdio.dev);
+   int i, j, ret;
+
+   priv->hwmon_name = devm_kstrdup(dev, dev_name(dev), GFP_KERNEL);
+   if (!priv->hwmon_name)
+   return -ENODEV;
+
+   for (i = j = 0; priv->hwmon_name[i]; i++) {
+   if (isalnum(priv->hwmon_name[i])) {
+   if (i != j)
+   priv->hwmon_name[j] = priv->hwmon_name[i];
+   j++;
+   }
+   }
+

Re: [PATCH net-next 02/12] clk: sunxi-ng: r40: export a regmap to access the GMAC register

2018-04-03 Thread Maxime Ripard
On Tue, Mar 20, 2018 at 03:15:02PM +0800, Chen-Yu Tsai wrote:
> On Mon, Mar 19, 2018 at 5:31 AM, Maxime Ripard
>  wrote:
> > On Sat, Mar 17, 2018 at 05:28:47PM +0800, Chen-Yu Tsai wrote:
> >> From: Icenowy Zheng 
> >>
> >> There's a GMAC configuration register, which exists on A64/A83T/H3/H5 in
> >> the syscon part, in the CCU of R40 SoC.
> >>
> >> Export a regmap of the CCU.
> >>
> >> Read access is not restricted to all registers, but only the GMAC
> >> register is allowed to be written.
> >>
> >> Signed-off-by: Icenowy Zheng 
> >> Signed-off-by: Chen-Yu Tsai 
> >
> > Gah, this is crazy. I'm really starting to regret letting that syscon
> > in in the first place...
> 
> IMHO syscon is really a better fit. It's part of the glue layer and
> most other dwmac user platforms treat it as such and use a syscon.
> Plus the controls encompass delays (phase), inverters (polarity),
> and even signal routing. It's not really just a group of clock controls,
> like what we poorly modeled for A20/A31. I think that was really a
> mistake.
> 
> As I mentioned in the cover letter, a slightly saner approach would
> be to let drivers add custom syscon entries, which would then require
> less custom plumbing.

A syscon is convenient, sure, but it also bypasses any abstraction
layer we have everywhere else, which means that we'll have to maintain
the register layout in each and every driver that uses it.

So far, it's only be the GMAC, but it can also be others (the SRAM
controller comes to my mind), and then, if there's any difference in
the design in a future SoC, we'll have to maintain that in the GMAC
driver as well.

> > And I'm not really looking forward the time where SCPI et al. will be
> > mature and we'll have the clock controller completely outside of our
> > control.
> 
> I don't think it's going to happen for any of the older SoCs. The R40
> only stands out because the GMAC controls are in the clock controller
> address space, presumably to be like the A20.

SCPI (or equivalent) is a really nice feature to have when it comes to
virtualization, so even if it's less likely, it doesn't make it less
relevant on other SoCs.

Maxime

-- 
Maxime Ripard, Bootlin (formerly Free Electrons)
Embedded Linux and Kernel engineering
https://bootlin.com


signature.asc
Description: PGP signature


Re: [PATCH net-next 02/12] clk: sunxi-ng: r40: export a regmap to access the GMAC register

2018-04-03 Thread Maxime Ripard
On Tue, Apr 03, 2018 at 11:48:45AM +0200, Maxime Ripard wrote:
> On Tue, Mar 20, 2018 at 03:15:02PM +0800, Chen-Yu Tsai wrote:
> > On Mon, Mar 19, 2018 at 5:31 AM, Maxime Ripard
> >  wrote:
> > > On Sat, Mar 17, 2018 at 05:28:47PM +0800, Chen-Yu Tsai wrote:
> > >> From: Icenowy Zheng 
> > >>
> > >> There's a GMAC configuration register, which exists on A64/A83T/H3/H5 in
> > >> the syscon part, in the CCU of R40 SoC.
> > >>
> > >> Export a regmap of the CCU.
> > >>
> > >> Read access is not restricted to all registers, but only the GMAC
> > >> register is allowed to be written.
> > >>
> > >> Signed-off-by: Icenowy Zheng 
> > >> Signed-off-by: Chen-Yu Tsai 
> > >
> > > Gah, this is crazy. I'm really starting to regret letting that syscon
> > > in in the first place...
> > 
> > IMHO syscon is really a better fit. It's part of the glue layer and
> > most other dwmac user platforms treat it as such and use a syscon.
> > Plus the controls encompass delays (phase), inverters (polarity),
> > and even signal routing. It's not really just a group of clock controls,
> > like what we poorly modeled for A20/A31. I think that was really a
> > mistake.
> > 
> > As I mentioned in the cover letter, a slightly saner approach would
> > be to let drivers add custom syscon entries, which would then require
> > less custom plumbing.
> 
> A syscon is convenient, sure, but it also bypasses any abstraction
> layer we have everywhere else, which means that we'll have to maintain
> the register layout in each and every driver that uses it.
> 
> So far, it's only be the GMAC, but it can also be others (the SRAM
> controller comes to my mind), and then, if there's any difference in
> the design in a future SoC, we'll have to maintain that in the GMAC
> driver as well.

I guess I forgot to say something, I'm fine with using a syscon we
already have.

I'm just questionning if merging any other driver using one is the
right move.

Maxime

-- 
Maxime Ripard, Bootlin (formerly Free Electrons)
Embedded Linux and Kernel engineering
https://bootlin.com


signature.asc
Description: PGP signature


Re: possible deadlock in skb_queue_tail

2018-04-03 Thread Kirill Tkhai
On 02.04.2018 12:20, syzbot wrote:
> Hello,
> 
> syzbot hit the following crash on net-next commit
> 06b19fe9a6df7aaa423cd8404ebe5ac9ec4b2960 (Sun Apr 1 03:37:33 2018 +)
> Merge branch 'chelsio-inline-tls'
> syzbot dashboard link: 
> https://syzkaller.appspot.com/bug?extid=6b495100f17ca8554ab9
> 
> Unfortunately, I don't have any reproducer for this crash yet.
> Raw console output: 
> https://syzkaller.appspot.com/x/log.txt?id=6218830443446272
> Kernel config: https://syzkaller.appspot.com/x/.config?id=3327544840960562528
> compiler: gcc (GCC) 7.1.1 20170620
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+6b495100f17ca8554...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for details.
> If you forward the report, please keep this part and the footer.
> 
> 
> ==
> WARNING: possible circular locking dependency detected
> 4.16.0-rc6+ #290 Not tainted
> --
> syz-executor7/20971 is trying to acquire lock:
>  (&af_unix_sk_receive_queue_lock_key){+.+.}, at: [<271ef0d8>] 
> skb_queue_tail+0x26/0x150 net/core/skbuff.c:2899
> 
> but task is already holding lock:
>  (&(&u->lock)->rlock/1){+.+.}, at: [<4e725e14>] 
> unix_state_double_lock+0x7b/0xb0 net/unix/af_unix.c:1088
> 
> which lock already depends on the new lock.
> 
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #1 (&(&u->lock)->rlock/1){+.+.}:
>    _raw_spin_lock_nested+0x28/0x40 kernel/locking/spinlock.c:354
>    sk_diag_dump_icons net/unix/diag.c:82 [inline]
>    sk_diag_fill.isra.4+0xa52/0xfe0 net/unix/diag.c:144
>    sk_diag_dump net/unix/diag.c:178 [inline]
>    unix_diag_dump+0x400/0x4f0 net/unix/diag.c:206
>    netlink_dump+0x492/0xcf0 net/netlink/af_netlink.c:2221
>    __netlink_dump_start+0x4ec/0x710 net/netlink/af_netlink.c:2318
>    netlink_dump_start include/linux/netlink.h:214 [inline]
>    unix_diag_handler_dump+0x3e7/0x750 net/unix/diag.c:307
>    __sock_diag_cmd net/core/sock_diag.c:230 [inline]
>    sock_diag_rcv_msg+0x204/0x360 net/core/sock_diag.c:261
>    netlink_rcv_skb+0x14b/0x380 net/netlink/af_netlink.c:2443
>    sock_diag_rcv+0x2a/0x40 net/core/sock_diag.c:272
>    netlink_unicast_kernel net/netlink/af_netlink.c:1307 [inline]
>    netlink_unicast+0x4c4/0x6b0 net/netlink/af_netlink.c:1333
>    netlink_sendmsg+0xa4a/0xe80 net/netlink/af_netlink.c:1896
>    sock_sendmsg_nosec net/socket.c:629 [inline]
>    sock_sendmsg+0xca/0x110 net/socket.c:639
>    sock_write_iter+0x31a/0x5d0 net/socket.c:908
>    call_write_iter include/linux/fs.h:1782 [inline]
>    new_sync_write fs/read_write.c:469 [inline]
>    __vfs_write+0x684/0x970 fs/read_write.c:482
>    vfs_write+0x189/0x510 fs/read_write.c:544
>    SYSC_write fs/read_write.c:589 [inline]
>    SyS_write+0xef/0x220 fs/read_write.c:581
>    do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
>    entry_SYSCALL_64_after_hwframe+0x42/0xb7
> 
> -> #0 (&af_unix_sk_receive_queue_lock_key){+.+.}:
>    lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920
>    __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
>    _raw_spin_lock_irqsave+0x96/0xc0 kernel/locking/spinlock.c:152
>    skb_queue_tail+0x26/0x150 net/core/skbuff.c:2899
>    unix_dgram_sendmsg+0xa30/0x1610 net/unix/af_unix.c:1807
>    sock_sendmsg_nosec net/socket.c:629 [inline]
>    sock_sendmsg+0xca/0x110 net/socket.c:639
>    ___sys_sendmsg+0x320/0x8b0 net/socket.c:2047
>    __sys_sendmmsg+0x1ee/0x620 net/socket.c:2137
>    SYSC_sendmmsg net/socket.c:2168 [inline]
>    SyS_sendmmsg+0x35/0x60 net/socket.c:2163
>    do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
>    entry_SYSCALL_64_after_hwframe+0x42/0xb7

sk_diag_dump_icons() dumps only sockets in TCP_LISTEN state.
TCP_LISTEN state may be assigned in only place in net/unix/af_unix.c:
it's unix_listen(). The function is applied to stream and seqpacket
socket types.

It can't be stream because of the second stack, and seqpacket also can't,
as I don't think it's possible for gcc to inline unix_seqpacket_sendmsg()
in the way, we don't see it in the stack.

So, this is looks like false positive result for me.

Kirill

> 
> other info that might help us debug this:
> 
>  Possible unsafe locking scenario:
> 
>    CPU0    CPU1
>        
>   lock(&(&u->lock)->rlock/1);
>    lock(&af_unix_sk_receive_queue_lock_key);
>    lock(&(&u->lock)->rlock/1);
>   lock(&af_unix_sk_receive_queue_lock_key);
> 
>  *** DEADLOCK ***
> 
> 1 lock held by syz-executor7/20971:
>  #0:  (&(&u->lock)->rlock/1){+.+.}, at: [<4e725e14>] 
> unix_state_double_lock+0x7b/0xb0 net/unix/af_unix.c

FW: gretap tunnel redirecting 2 different networks on destination host

2018-04-03 Thread Marc Roos
 
I see you are quite busy with discussing the patches etc. If this is the 
incorrect place to ask for a little help please let me know. I just got 
this from some one on stack overflow who got some answers here.




-Original Message-
Subject: gretap tunnel redirecting 2 different networks on destination 
host


How can I get the 10.11.12.x traffic received on tun1 at server B to 
eth2, and 172.16.1.x to eth1? 


I have a server A that sends 172.16.1.x and 10.11.12.x traffic via a 
gretab tunnel 192.168.1.x to server B.

+-+ ++
 172.16.1.x |  B  | |  A |
 ---|eth1 | 192.168.1.x GRETAP  ||
| tun1|-|tun1|
 10.11.12.x | | ||
 ---|eth2 | ||
+-+ ++

When I put the tun1 interface of server B in a bridge with eth1 I am 
able to ping several 172.16.1.x ip's from server A. And communication on 

this network seems to be ok


- I cannot put eth2 on the same bridge. 
- I thought of creating a 2nd gretab tunnel and use each tunnel for a 
network, but I think there is probably a better solution.









Re: [PATCH net-next 02/12] clk: sunxi-ng: r40: export a regmap to access the GMAC register

2018-04-03 Thread Chen-Yu Tsai
On Tue, Apr 3, 2018 at 5:50 PM, Maxime Ripard  wrote:
> On Tue, Apr 03, 2018 at 11:48:45AM +0200, Maxime Ripard wrote:
>> On Tue, Mar 20, 2018 at 03:15:02PM +0800, Chen-Yu Tsai wrote:
>> > On Mon, Mar 19, 2018 at 5:31 AM, Maxime Ripard
>> >  wrote:
>> > > On Sat, Mar 17, 2018 at 05:28:47PM +0800, Chen-Yu Tsai wrote:
>> > >> From: Icenowy Zheng 
>> > >>
>> > >> There's a GMAC configuration register, which exists on A64/A83T/H3/H5 in
>> > >> the syscon part, in the CCU of R40 SoC.
>> > >>
>> > >> Export a regmap of the CCU.
>> > >>
>> > >> Read access is not restricted to all registers, but only the GMAC
>> > >> register is allowed to be written.
>> > >>
>> > >> Signed-off-by: Icenowy Zheng 
>> > >> Signed-off-by: Chen-Yu Tsai 
>> > >
>> > > Gah, this is crazy. I'm really starting to regret letting that syscon
>> > > in in the first place...
>> >
>> > IMHO syscon is really a better fit. It's part of the glue layer and
>> > most other dwmac user platforms treat it as such and use a syscon.
>> > Plus the controls encompass delays (phase), inverters (polarity),
>> > and even signal routing. It's not really just a group of clock controls,
>> > like what we poorly modeled for A20/A31. I think that was really a
>> > mistake.
>> >
>> > As I mentioned in the cover letter, a slightly saner approach would
>> > be to let drivers add custom syscon entries, which would then require
>> > less custom plumbing.
>>
>> A syscon is convenient, sure, but it also bypasses any abstraction
>> layer we have everywhere else, which means that we'll have to maintain
>> the register layout in each and every driver that uses it.
>>
>> So far, it's only be the GMAC, but it can also be others (the SRAM
>> controller comes to my mind), and then, if there's any difference in
>> the design in a future SoC, we'll have to maintain that in the GMAC
>> driver as well.
>
> I guess I forgot to say something, I'm fine with using a syscon we
> already have.
>
> I'm just questionning if merging any other driver using one is the
> right move.

Right. So in this case, we are not actually going through the syscon
API. Rather we are exporting a regmap whose properties we actually
define. If it makes you more acceptable to it, we could map just
the GMAC register in the new regmap, and also have it named. This
is all plumbing within the kernel so the device tree stays the same.

ChenYu


Re: [PATCH net-next 02/12] clk: sunxi-ng: r40: export a regmap to access the GMAC register

2018-04-03 Thread Icenowy Zheng


于 2018年4月3日 GMT+08:00 下午5:50:05, Maxime Ripard  写到:
>On Tue, Apr 03, 2018 at 11:48:45AM +0200, Maxime Ripard wrote:
>> On Tue, Mar 20, 2018 at 03:15:02PM +0800, Chen-Yu Tsai wrote:
>> > On Mon, Mar 19, 2018 at 5:31 AM, Maxime Ripard
>> >  wrote:
>> > > On Sat, Mar 17, 2018 at 05:28:47PM +0800, Chen-Yu Tsai wrote:
>> > >> From: Icenowy Zheng 
>> > >>
>> > >> There's a GMAC configuration register, which exists on
>A64/A83T/H3/H5 in
>> > >> the syscon part, in the CCU of R40 SoC.
>> > >>
>> > >> Export a regmap of the CCU.
>> > >>
>> > >> Read access is not restricted to all registers, but only the
>GMAC
>> > >> register is allowed to be written.
>> > >>
>> > >> Signed-off-by: Icenowy Zheng 
>> > >> Signed-off-by: Chen-Yu Tsai 
>> > >
>> > > Gah, this is crazy. I'm really starting to regret letting that
>syscon
>> > > in in the first place...
>> > 
>> > IMHO syscon is really a better fit. It's part of the glue layer and
>> > most other dwmac user platforms treat it as such and use a syscon.
>> > Plus the controls encompass delays (phase), inverters (polarity),
>> > and even signal routing. It's not really just a group of clock
>controls,
>> > like what we poorly modeled for A20/A31. I think that was really a
>> > mistake.
>> > 
>> > As I mentioned in the cover letter, a slightly saner approach would
>> > be to let drivers add custom syscon entries, which would then
>require
>> > less custom plumbing.
>> 
>> A syscon is convenient, sure, but it also bypasses any abstraction
>> layer we have everywhere else, which means that we'll have to
>maintain
>> the register layout in each and every driver that uses it.
>> 
>> So far, it's only be the GMAC, but it can also be others (the SRAM
>> controller comes to my mind), and then, if there's any difference in
>> the design in a future SoC, we'll have to maintain that in the GMAC
>> driver as well.
>
>I guess I forgot to say something, I'm fine with using a syscon we
>already have.
>
>I'm just questionning if merging any other driver using one is the
>right move.

Even for current SoCs supported by dwnac-sun8i, there
is a syscon/sram-controller problem. They're both at 0x1c0.

The first examples for the need of sram-controller is
A64, which we need to claim SRAM C for DE2 access.

>
>Maxime


Re: [PATCH net-next 02/12] clk: sunxi-ng: r40: export a regmap to access the GMAC register

2018-04-03 Thread Icenowy Zheng


于 2018年4月3日 GMT+08:00 下午5:53:08, Chen-Yu Tsai  写到:
>On Tue, Apr 3, 2018 at 5:50 PM, Maxime Ripard
> wrote:
>> On Tue, Apr 03, 2018 at 11:48:45AM +0200, Maxime Ripard wrote:
>>> On Tue, Mar 20, 2018 at 03:15:02PM +0800, Chen-Yu Tsai wrote:
>>> > On Mon, Mar 19, 2018 at 5:31 AM, Maxime Ripard
>>> >  wrote:
>>> > > On Sat, Mar 17, 2018 at 05:28:47PM +0800, Chen-Yu Tsai wrote:
>>> > >> From: Icenowy Zheng 
>>> > >>
>>> > >> There's a GMAC configuration register, which exists on
>A64/A83T/H3/H5 in
>>> > >> the syscon part, in the CCU of R40 SoC.
>>> > >>
>>> > >> Export a regmap of the CCU.
>>> > >>
>>> > >> Read access is not restricted to all registers, but only the
>GMAC
>>> > >> register is allowed to be written.
>>> > >>
>>> > >> Signed-off-by: Icenowy Zheng 
>>> > >> Signed-off-by: Chen-Yu Tsai 
>>> > >
>>> > > Gah, this is crazy. I'm really starting to regret letting that
>syscon
>>> > > in in the first place...
>>> >
>>> > IMHO syscon is really a better fit. It's part of the glue layer
>and
>>> > most other dwmac user platforms treat it as such and use a syscon.
>>> > Plus the controls encompass delays (phase), inverters (polarity),
>>> > and even signal routing. It's not really just a group of clock
>controls,
>>> > like what we poorly modeled for A20/A31. I think that was really a
>>> > mistake.
>>> >
>>> > As I mentioned in the cover letter, a slightly saner approach
>would
>>> > be to let drivers add custom syscon entries, which would then
>require
>>> > less custom plumbing.
>>>
>>> A syscon is convenient, sure, but it also bypasses any abstraction
>>> layer we have everywhere else, which means that we'll have to
>maintain
>>> the register layout in each and every driver that uses it.
>>>
>>> So far, it's only be the GMAC, but it can also be others (the SRAM
>>> controller comes to my mind), and then, if there's any difference in
>>> the design in a future SoC, we'll have to maintain that in the GMAC
>>> driver as well.
>>
>> I guess I forgot to say something, I'm fine with using a syscon we
>> already have.
>>
>> I'm just questionning if merging any other driver using one is the
>> right move.
>
>Right. So in this case, we are not actually going through the syscon
>API. Rather we are exporting a regmap whose properties we actually
>define. If it makes you more acceptable to it, we could map just
>the GMAC register in the new regmap, and also have it named. This
>is all plumbing within the kernel so the device tree stays the same.

I think my driver has already restricted the write permission
only to GMAC register.

>
>ChenYu


Re: [PATCH net-next 02/12] clk: sunxi-ng: r40: export a regmap to access the GMAC register

2018-04-03 Thread Chen-Yu Tsai
On Tue, Apr 3, 2018 at 5:54 PM, Icenowy Zheng  wrote:
>
>
> 于 2018年4月3日 GMT+08:00 下午5:53:08, Chen-Yu Tsai  写到:
>>On Tue, Apr 3, 2018 at 5:50 PM, Maxime Ripard
>> wrote:
>>> On Tue, Apr 03, 2018 at 11:48:45AM +0200, Maxime Ripard wrote:
 On Tue, Mar 20, 2018 at 03:15:02PM +0800, Chen-Yu Tsai wrote:
 > On Mon, Mar 19, 2018 at 5:31 AM, Maxime Ripard
 >  wrote:
 > > On Sat, Mar 17, 2018 at 05:28:47PM +0800, Chen-Yu Tsai wrote:
 > >> From: Icenowy Zheng 
 > >>
 > >> There's a GMAC configuration register, which exists on
>>A64/A83T/H3/H5 in
 > >> the syscon part, in the CCU of R40 SoC.
 > >>
 > >> Export a regmap of the CCU.
 > >>
 > >> Read access is not restricted to all registers, but only the
>>GMAC
 > >> register is allowed to be written.
 > >>
 > >> Signed-off-by: Icenowy Zheng 
 > >> Signed-off-by: Chen-Yu Tsai 
 > >
 > > Gah, this is crazy. I'm really starting to regret letting that
>>syscon
 > > in in the first place...
 >
 > IMHO syscon is really a better fit. It's part of the glue layer
>>and
 > most other dwmac user platforms treat it as such and use a syscon.
 > Plus the controls encompass delays (phase), inverters (polarity),
 > and even signal routing. It's not really just a group of clock
>>controls,
 > like what we poorly modeled for A20/A31. I think that was really a
 > mistake.
 >
 > As I mentioned in the cover letter, a slightly saner approach
>>would
 > be to let drivers add custom syscon entries, which would then
>>require
 > less custom plumbing.

 A syscon is convenient, sure, but it also bypasses any abstraction
 layer we have everywhere else, which means that we'll have to
>>maintain
 the register layout in each and every driver that uses it.

 So far, it's only be the GMAC, but it can also be others (the SRAM
 controller comes to my mind), and then, if there's any difference in
 the design in a future SoC, we'll have to maintain that in the GMAC
 driver as well.
>>>
>>> I guess I forgot to say something, I'm fine with using a syscon we
>>> already have.
>>>
>>> I'm just questionning if merging any other driver using one is the
>>> right move.
>>
>>Right. So in this case, we are not actually going through the syscon
>>API. Rather we are exporting a regmap whose properties we actually
>>define. If it makes you more acceptable to it, we could map just
>>the GMAC register in the new regmap, and also have it named. This
>>is all plumbing within the kernel so the device tree stays the same.
>
> I think my driver has already restricted the write permission
> only to GMAC register.

Correct, but it still maps the entire region out, which means the
consumer needs to know which offset to use. Maxime is saying this
is something that is troublesome to maintain. So my proposal was
to create a regmap with a base at the GMAC register offset. That
way, the consumer doesn't need to use an offset to access it.

ChenYu


Re: [PATCH net-next] bridge: Allow max MTU when multiple VLANs present

2018-04-03 Thread Chas Williams
On Tue, Apr 3, 2018 at 2:13 AM, Roopa Prabhu  wrote:
> On Mon, Apr 2, 2018 at 8:26 AM, Chas Williams <3ch...@gmail.com> wrote:
>> On Mon, Apr 2, 2018 at 11:08 AM, Roopa Prabhu  
>> wrote:
>>>
>
> [snip]
>
>>> they are popular...in-fact they are the default bridge mode on our
>>> network switches.
>>> And they have been around for some time now to ignore its users.
>>> Plus it is not right to change default mtu behavior for one mode of the 
>>> bridge
>>> and not the others (bridge mtu handling from user-space is complex enough 
>>> today
>>> due to dynamic mtu changes on port enslave/deslave).
>>
>> I don't see the issue with one mode of bridge behaving differently
>> from another mode.
>> The VLAN behavior between the two bridge modes is completely different so 
>> having
>> a different MTU behavior doesn't seem that surprising.
>>
>> You are potentially mixing different sized VLAN on a same bridge.  The only 
>> sane
>> choice is to pick the largest MTU for the bridge.  This lets you have
>> whatever MTU
>> is appropriate on the child VLAN interfaces of the bridge.  If you
>> attempt to forward
>> from a port with a larger MTU to a smaller MTU, you get the expected 
>> behavior.
>
>
> you mean larger MTU on the vlan device on the bridge to a smaller MTU
> on the bridge port ?.
> this will result in dropping the packet. how is this supposed to be
> expected default behavior ?.

If a user configures the VLAN device to be a larger than MTU than the port,
then yes, I expect the packet to be dropped.  That's a msconfiguration of either
the VLAN's or port's MTU.  We can't protect the user from that by simply making
sure they can't mismatch the MTUs because you can still get packets dropped
during ingress from the large MTU VLAN.

>> Forcing the end user to configure all the ports to the maximum MTU of
>> all the VLANs
>> on the bridge is wrong IMHO.
>> You then risk attempting to forward
>> oversize packets
>> on a network that can't support that.
>
> I am a bit confused: Are you trying to solve the config problem by
> implicitly making it the default and there by creating the oversize
> packet drop issue by default ?

I am attempting to allow a configuration that lets me choose the appropriate
MTU size for each port.  With the previous code to configure a VLAN device
with an MTU of 9000, I would need to configure all the ports
on the bridge with an MTU of 9000 regardless of whether those ports should
be passing large MTU traffic.  I am creating a potential packet drop issue
by forwarding traffic between ports that have an artificially inflated MTUs.


 I don't think those drops are unexpected.  If a user has misconfigured
 the bridge
 we can't be expected to fix that for them.  It is the user's
 responsbility to ensure
 that the ports on the VLAN have a size consistent with the traffic
 they expect to
 pass.

>>>
>>> By default they are not expected today. The problem is changing the bridge
>>> to max mtu changes 'all' the vlan devices on top of the vlan aware bridge to
>>> max mtu by default which makes drops at the bridge driver more common if the
>>> user had mixed mtu on its ports.
>>
>> That's not been my experience.  The MTU on the vlan devices is only
>> limited by the
>> bridges's MTU.  Setting the bridge MTU doesn't change the children
>> VLAN devices MTUs.
>
> It does not, but it now allows vlan devices on the bridge to have a
> larger MTU if they need to (some or all of them).
> This is consistent with vxlan driver as well: picks default mtu to be
> lower or equal to the default dst dev mtu and allows user to override
> it with a larger MTU.

The VLAN device MTU can't be larger than the parent MTU.  The end user
is just going to set the parent bridge MTU to be larger anyway, so why not
just make that the default?


Re: [PATCH 00/47] Netfilter/IPVS updates for net-next

2018-04-03 Thread Pablo Neira Ayuso
Hi Rafal,

On Tue, Apr 03, 2018 at 08:13:49AM +0200, Rafał Miłecki wrote:
> Hi Pablo,
> 
[...]
> I see you mentioned changes from Felix in the pull request but:
> 1) I don't see any commits from Felix listed below
> 2) I don't think you sent any of these patches
> 
> Can you take a look at what has happened to them, please?

I will include them in my next pull request once net-next opens up
again.

Thanks.


Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-04-03 Thread Ingo Molnar

* Pavel Machek  wrote:

> > > > Yeah, so generic memcpy() replacement is only feasible I think if the 
> > > > most 
> > > > optimistic implementation is actually correct:
> > > > 
> > > >  - if no preempt disable()/enable() is required
> > > > 
> > > >  - if direct access to the AVX[2] registers does not disturb legacy FPU 
> > > > state in 
> > > >any fashion
> > > > 
> > > >  - if direct access to the AVX[2] registers cannot raise weird 
> > > > exceptions or have
> > > >weird behavior if the FPU control word is modified to non-standard 
> > > > values by 
> > > >untrusted user-space
> > > > 
> > > > If we have to touch the FPU tag or control words then it's probably 
> > > > only good for 
> > > > a specialized API.
> > > 
> > > I did not mean to have a general memcpy replacement. Rather something like
> > > magic_memcpy() which falls back to memcpy when AVX is not usable or the
> > > length does not justify the AVX stuff at all.
> > 
> > OK, fair enough.
> > 
> > Note that a generic version might still be worth trying out, if and only if 
> > it's 
> > safe to access those vector registers directly: modern x86 CPUs will do 
> > their 
> > non-constant memcpy()s via the common memcpy_erms() function - which could 
> > in 
> > theory be an easy common point to be (cpufeatures-) patched to an AVX2 
> > variant, if 
> > size (and alignment, perhaps) is a multiple of 32 bytes or so.
> 
> How is AVX2 supposed to help the memcpy speed?
> 
> If the copy is small, constant overhead will dominate, and I don't
> think AVX2 is going to be win there.

There are several advantages:

1)

"REP; MOVS" (also called ERMS) has a significant constant "setup cost".

In the scheme I suggested (and if it's possible) then single-register AVX2 
access 
on the other hand has a setup cost on the "few cycles" order of magnitude.

2)

AVX2 have various non-temporary load and store behavioral variants - while 
"REP; 
MOVS" doesn't (or rather, any such caching optimizations, to the extent they 
exist, are hidden in the microcode).

> If the copy is big, well, the copy loop will likely run out of L1 and maybe 
> even 
> out of L2, and at that point speed of the loop does not matter because memory 
> is 
> slow...?

In many cases "memory" will be something very fast, such as another level of 
cache. Also, on NUMA "memory" can also be something locally wired to the CPU - 
again accessible at ridiculous bandwidths.

Nevertheless ERMS is probably wins for the regular bulk memcpy by a few 
percentage 
points, so I don't think AVX2 is a win in the generic large-memcpy case, as 
long 
as continued caching of both the loads and the stores is beneficial.

Thanks,

Ingo


Re: [GIT PULL] remove in-kernel calls to syscalls

2018-04-03 Thread Ingo Molnar

* Linus Torvalds  wrote:

> On Mon, Apr 2, 2018 at 12:04 PM, Dominik Brodowski
>  wrote:
> >
> > This patchset removes all in-kernel calls to syscall functions in the
> > kernel with the exception of arch/.
> 
> Ok, this finished off my arch updates for today, I'll probably move on
> to driver pulls tomorrow.
> 
> Anyway, it's in my tree, will push out once my test build finishes.

Thanks!

Dominik, if you submit the x86 ptregs conversion patches in the next 1-2 days 
on 
top of Linus's tree (642e7fd23353), then I can apply them and if they are 
problem-free I can perhaps tempt Linus with a pull request early next week or 
so.

The Spectre angle does make me want those changes as well.

Thanks,

Ingo


[net-next V9 PATCH 02/16] xdp: introduce xdp_return_frame API and use in cpumap

2018-04-03 Thread Jesper Dangaard Brouer
Introduce an xdp_return_frame API, and convert over cpumap as
the first user, given it have queued XDP frame structure to leverage.

V3: Cleanup and remove C99 style comments, pointed out by Alex Duyck.
V6: Remove comment that id will be added later (Req by Alex Duyck)
V8: Rename enum mem_type to xdp_mem_type (found by kbuild test robot)

Signed-off-by: Jesper Dangaard Brouer 
---
 include/net/xdp.h   |   27 +++
 kernel/bpf/cpumap.c |   60 +++
 net/core/xdp.c  |   18 +++
 3 files changed, 81 insertions(+), 24 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index b2362ddfa694..e4207699c410 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -33,16 +33,43 @@
  * also mandatory during RX-ring setup.
  */
 
+enum xdp_mem_type {
+   MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */
+   MEM_TYPE_PAGE_ORDER0, /* Orig XDP full page model */
+   MEM_TYPE_MAX,
+};
+
+struct xdp_mem_info {
+   u32 type; /* enum xdp_mem_type, but known size type */
+};
+
 struct xdp_rxq_info {
struct net_device *dev;
u32 queue_index;
u32 reg_state;
+   struct xdp_mem_info mem;
 } cacheline_aligned; /* perf critical, avoid false-sharing */
 
+
+static inline
+void xdp_return_frame(void *data, struct xdp_mem_info *mem)
+{
+   if (mem->type == MEM_TYPE_PAGE_SHARED)
+   page_frag_free(data);
+
+   if (mem->type == MEM_TYPE_PAGE_ORDER0) {
+   struct page *page = virt_to_page(data); /* Assumes order0 page*/
+
+   put_page(page);
+   }
+}
+
 int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
 struct net_device *dev, u32 queue_index);
 void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq);
 void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq);
 bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq);
+int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
+  enum xdp_mem_type type, void *allocator);
 
 #endif /* __LINUX_NET_XDP_H__ */
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index a4bb0b34375a..3e4bbcbe3e86 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -137,27 +138,6 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
return ERR_PTR(err);
 }
 
-static void __cpu_map_queue_destructor(void *ptr)
-{
-   /* The tear-down procedure should have made sure that queue is
-* empty.  See __cpu_map_entry_replace() and work-queue
-* invoked cpu_map_kthread_stop(). Catch any broken behaviour
-* gracefully and warn once.
-*/
-   if (WARN_ON_ONCE(ptr))
-   page_frag_free(ptr);
-}
-
-static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
-{
-   if (atomic_dec_and_test(&rcpu->refcnt)) {
-   /* The queue should be empty at this point */
-   ptr_ring_cleanup(rcpu->queue, __cpu_map_queue_destructor);
-   kfree(rcpu->queue);
-   kfree(rcpu);
-   }
-}
-
 static void get_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
 {
atomic_inc(&rcpu->refcnt);
@@ -188,6 +168,10 @@ struct xdp_pkt {
u16 len;
u16 headroom;
u16 metasize;
+   /* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
+* while mem info is valid on remote CPU.
+*/
+   struct xdp_mem_info mem;
struct net_device *dev_rx;
 };
 
@@ -213,6 +197,9 @@ static struct xdp_pkt *convert_to_xdp_pkt(struct xdp_buff 
*xdp)
xdp_pkt->headroom = headroom - sizeof(*xdp_pkt);
xdp_pkt->metasize = metasize;
 
+   /* rxq only valid until napi_schedule ends, convert to xdp_mem_info */
+   xdp_pkt->mem = xdp->rxq->mem;
+
return xdp_pkt;
 }
 
@@ -265,6 +252,31 @@ static struct sk_buff *cpu_map_build_skb(struct 
bpf_cpu_map_entry *rcpu,
return skb;
 }
 
+static void __cpu_map_ring_cleanup(struct ptr_ring *ring)
+{
+   /* The tear-down procedure should have made sure that queue is
+* empty.  See __cpu_map_entry_replace() and work-queue
+* invoked cpu_map_kthread_stop(). Catch any broken behaviour
+* gracefully and warn once.
+*/
+   struct xdp_pkt *xdp_pkt;
+
+   while ((xdp_pkt = ptr_ring_consume(ring)))
+   if (WARN_ON_ONCE(xdp_pkt))
+   xdp_return_frame(xdp_pkt, &xdp_pkt->mem);
+}
+
+static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
+{
+   if (atomic_dec_and_test(&rcpu->refcnt)) {
+   /* The queue should be empty at this point */
+   __cpu_map_ring_cleanup(rcpu->queue);
+   ptr_ring_cleanup(rcpu->queue, NULL);
+   kfree(rcpu->queue);
+   kfree(rcpu);
+   }
+}
+
 static int cpu_map_kthread_run(void *data)
 {
struct bpf_cpu_map_entry *rcpu = data;
@@ -3

[net-next V9 PATCH 07/16] virtio_net: convert to use generic xdp_frame and xdp_return_frame API

2018-04-03 Thread Jesper Dangaard Brouer
The virtio_net driver assumes XDP frames are always released based on
page refcnt (via put_page).  Thus, is only queues the XDP data pointer
address and uses virt_to_head_page() to retrieve struct page.

Use the XDP return API to get away from such assumptions. Instead
queue an xdp_frame, which allow us to use the xdp_return_frame API,
when releasing the frame.

V8: Avoid endianness issues (found by kbuild test robot)
V9: Change __virtnet_xdp_xmit from bool to int return value (found by Dan 
Carpenter)

Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/virtio_net.c |   54 +-
 1 file changed, 29 insertions(+), 25 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7b187ec7411e..f50e1ad81ad4 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -415,38 +415,48 @@ static void virtnet_xdp_flush(struct net_device *dev)
virtqueue_kick(sq->vq);
 }
 
-static bool __virtnet_xdp_xmit(struct virtnet_info *vi,
-  struct xdp_buff *xdp)
+static int __virtnet_xdp_xmit(struct virtnet_info *vi,
+ struct xdp_buff *xdp)
 {
struct virtio_net_hdr_mrg_rxbuf *hdr;
-   unsigned int len;
+   struct xdp_frame *xdpf, *xdpf_sent;
struct send_queue *sq;
+   unsigned int len;
unsigned int qp;
-   void *xdp_sent;
int err;
 
qp = vi->curr_queue_pairs - vi->xdp_queue_pairs + smp_processor_id();
sq = &vi->sq[qp];
 
/* Free up any pending old buffers before queueing new ones. */
-   while ((xdp_sent = virtqueue_get_buf(sq->vq, &len)) != NULL) {
-   struct page *sent_page = virt_to_head_page(xdp_sent);
+   while ((xdpf_sent = virtqueue_get_buf(sq->vq, &len)) != NULL)
+   xdp_return_frame(xdpf_sent->data, &xdpf_sent->mem);
 
-   put_page(sent_page);
-   }
+   xdpf = convert_to_xdp_frame(xdp);
+   if (unlikely(!xdpf))
+   return -EOVERFLOW;
+
+   /* virtqueue want to use data area in-front of packet */
+   if (unlikely(xdpf->metasize > 0))
+   return -EOPNOTSUPP;
 
-   xdp->data -= vi->hdr_len;
+   if (unlikely(xdpf->headroom < vi->hdr_len))
+   return -EOVERFLOW;
+
+   /* Make room for virtqueue hdr (also change xdpf->headroom?) */
+   xdpf->data -= vi->hdr_len;
/* Zero header and leave csum up to XDP layers */
-   hdr = xdp->data;
+   hdr = xdpf->data;
memset(hdr, 0, vi->hdr_len);
+   xdpf->len   += vi->hdr_len;
 
-   sg_init_one(sq->sg, xdp->data, xdp->data_end - xdp->data);
+   sg_init_one(sq->sg, xdpf->data, xdpf->len);
 
-   err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdp->data, GFP_ATOMIC);
+   err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdpf, GFP_ATOMIC);
if (unlikely(err))
-   return false; /* Caller handle free/refcnt */
+   return -ENOSPC; /* Caller handle free/refcnt */
 
-   return true;
+   return 0;
 }
 
 static int virtnet_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
@@ -454,7 +464,6 @@ static int virtnet_xdp_xmit(struct net_device *dev, struct 
xdp_buff *xdp)
struct virtnet_info *vi = netdev_priv(dev);
struct receive_queue *rq = vi->rq;
struct bpf_prog *xdp_prog;
-   bool sent;
 
/* Only allow ndo_xdp_xmit if XDP is loaded on dev, as this
 * indicate XDP resources have been successfully allocated.
@@ -463,10 +472,7 @@ static int virtnet_xdp_xmit(struct net_device *dev, struct 
xdp_buff *xdp)
if (!xdp_prog)
return -ENXIO;
 
-   sent = __virtnet_xdp_xmit(vi, xdp);
-   if (!sent)
-   return -ENOSPC;
-   return 0;
+   return __virtnet_xdp_xmit(vi, xdp);
 }
 
 static unsigned int virtnet_get_headroom(struct virtnet_info *vi)
@@ -555,7 +561,6 @@ static struct sk_buff *receive_small(struct net_device *dev,
struct page *page = virt_to_head_page(buf);
unsigned int delta = 0;
struct page *xdp_page;
-   bool sent;
int err;
 
len -= vi->hdr_len;
@@ -606,8 +611,8 @@ static struct sk_buff *receive_small(struct net_device *dev,
delta = orig_data - xdp.data;
break;
case XDP_TX:
-   sent = __virtnet_xdp_xmit(vi, &xdp);
-   if (unlikely(!sent)) {
+   err = __virtnet_xdp_xmit(vi, &xdp);
+   if (unlikely(err)) {
trace_xdp_exception(vi->dev, xdp_prog, act);
goto err_xdp;
}
@@ -690,7 +695,6 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
struct bpf_prog *xdp_prog;
unsigned int truesize;
unsigned int headroom = mergeable_ctx_to_headroom(ctx);
-   bool sent;
int err;
 
head_skb =

[net-next V9 PATCH 03/16] ixgbe: use xdp_return_frame API

2018-04-03 Thread Jesper Dangaard Brouer
Extend struct ixgbe_tx_buffer to store the xdp_mem_info.

Notice that this could be optimized further by putting this into
a union in the struct ixgbe_tx_buffer, but this patchset
works towards removing this again.  Thus, this is not done.

Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h  |1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |6 --
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 4f08c712e58e..abb5248e917e 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -250,6 +250,7 @@ struct ixgbe_tx_buffer {
DEFINE_DMA_UNMAP_ADDR(dma);
DEFINE_DMA_UNMAP_LEN(len);
u32 tx_flags;
+   struct xdp_mem_info xdp_mem;
 };
 
 struct ixgbe_rx_buffer {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index afadba99f7b8..0bfe6cf2bf8b 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1216,7 +1216,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector 
*q_vector,
 
/* free the skb */
if (ring_is_xdp(tx_ring))
-   page_frag_free(tx_buffer->data);
+   xdp_return_frame(tx_buffer->data, &tx_buffer->xdp_mem);
else
napi_consume_skb(tx_buffer->skb, napi_budget);
 
@@ -5797,7 +5797,7 @@ static void ixgbe_clean_tx_ring(struct ixgbe_ring 
*tx_ring)
 
/* Free all the Tx ring sk_buffs */
if (ring_is_xdp(tx_ring))
-   page_frag_free(tx_buffer->data);
+   xdp_return_frame(tx_buffer->data, &tx_buffer->xdp_mem);
else
dev_kfree_skb_any(tx_buffer->skb);
 
@@ -8366,6 +8366,8 @@ static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter 
*adapter,
dma_unmap_len_set(tx_buffer, len, len);
dma_unmap_addr_set(tx_buffer, dma, dma);
tx_buffer->data = xdp->data;
+   tx_buffer->xdp_mem = xdp->rxq->mem;
+
tx_desc->read.buffer_addr = cpu_to_le64(dma);
 
/* put descriptor type bits */



[net-next V9 PATCH 06/16] tun: convert to use generic xdp_frame and xdp_return_frame API

2018-04-03 Thread Jesper Dangaard Brouer
From: Jesper Dangaard Brouer 

The tuntap driver invented it's own driver specific way of queuing
XDP packets, by storing the xdp_buff information in the top of
the XDP frame data.

Convert it over to use the more generic xdp_frame structure.  The
main problem with the in-driver method is that the xdp_rxq_info pointer
cannot be trused/used when dequeueing the frame.

V3: Remove check based on feedback from Jason

Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/tun.c  |   43 ---
 drivers/vhost/net.c|7 ---
 include/linux/if_tun.h |4 ++--
 3 files changed, 26 insertions(+), 28 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index a1ba262f40ad..714735c6d3ff 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -248,11 +248,11 @@ struct veth {
__be16 h_vlan_TCI;
 };
 
-bool tun_is_xdp_buff(void *ptr)
+bool tun_is_xdp_frame(void *ptr)
 {
return (unsigned long)ptr & TUN_XDP_FLAG;
 }
-EXPORT_SYMBOL(tun_is_xdp_buff);
+EXPORT_SYMBOL(tun_is_xdp_frame);
 
 void *tun_xdp_to_ptr(void *ptr)
 {
@@ -660,10 +660,10 @@ void tun_ptr_free(void *ptr)
 {
if (!ptr)
return;
-   if (tun_is_xdp_buff(ptr)) {
-   struct xdp_buff *xdp = tun_ptr_to_xdp(ptr);
+   if (tun_is_xdp_frame(ptr)) {
+   struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
 
-   put_page(virt_to_head_page(xdp->data));
+   xdp_return_frame(xdpf->data, &xdpf->mem);
} else {
__skb_array_destroy_skb(ptr);
}
@@ -1291,17 +1291,14 @@ static const struct net_device_ops tun_netdev_ops = {
 static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
 {
struct tun_struct *tun = netdev_priv(dev);
-   struct xdp_buff *buff = xdp->data_hard_start;
-   int headroom = xdp->data - xdp->data_hard_start;
+   struct xdp_frame *frame;
struct tun_file *tfile;
u32 numqueues;
int ret = 0;
 
-   /* Assure headroom is available and buff is properly aligned */
-   if (unlikely(headroom < sizeof(*xdp) || tun_is_xdp_buff(xdp)))
-   return -ENOSPC;
-
-   *buff = *xdp;
+   frame = convert_to_xdp_frame(xdp);
+   if (unlikely(!frame))
+   return -EOVERFLOW;
 
rcu_read_lock();
 
@@ -1316,7 +1313,7 @@ static int tun_xdp_xmit(struct net_device *dev, struct 
xdp_buff *xdp)
/* Encode the XDP flag into lowest bit for consumer to differ
 * XDP buffer from sk_buff.
 */
-   if (ptr_ring_produce(&tfile->tx_ring, tun_xdp_to_ptr(buff))) {
+   if (ptr_ring_produce(&tfile->tx_ring, tun_xdp_to_ptr(frame))) {
this_cpu_inc(tun->pcpu_stats->tx_dropped);
ret = -ENOSPC;
}
@@ -1994,11 +1991,11 @@ static ssize_t tun_chr_write_iter(struct kiocb *iocb, 
struct iov_iter *from)
 
 static ssize_t tun_put_user_xdp(struct tun_struct *tun,
struct tun_file *tfile,
-   struct xdp_buff *xdp,
+   struct xdp_frame *xdp_frame,
struct iov_iter *iter)
 {
int vnet_hdr_sz = 0;
-   size_t size = xdp->data_end - xdp->data;
+   size_t size = xdp_frame->len;
struct tun_pcpu_stats *stats;
size_t ret;
 
@@ -2014,7 +2011,7 @@ static ssize_t tun_put_user_xdp(struct tun_struct *tun,
iov_iter_advance(iter, vnet_hdr_sz - sizeof(gso));
}
 
-   ret = copy_to_iter(xdp->data, size, iter) + vnet_hdr_sz;
+   ret = copy_to_iter(xdp_frame->data, size, iter) + vnet_hdr_sz;
 
stats = get_cpu_ptr(tun->pcpu_stats);
u64_stats_update_begin(&stats->syncp);
@@ -2182,11 +2179,11 @@ static ssize_t tun_do_read(struct tun_struct *tun, 
struct tun_file *tfile,
return err;
}
 
-   if (tun_is_xdp_buff(ptr)) {
-   struct xdp_buff *xdp = tun_ptr_to_xdp(ptr);
+   if (tun_is_xdp_frame(ptr)) {
+   struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
 
-   ret = tun_put_user_xdp(tun, tfile, xdp, to);
-   put_page(virt_to_head_page(xdp->data));
+   ret = tun_put_user_xdp(tun, tfile, xdpf, to);
+   xdp_return_frame(xdpf->data, &xdpf->mem);
} else {
struct sk_buff *skb = ptr;
 
@@ -2425,10 +2422,10 @@ static int tun_recvmsg(struct socket *sock, struct 
msghdr *m, size_t total_len,
 static int tun_ptr_peek_len(void *ptr)
 {
if (likely(ptr)) {
-   if (tun_is_xdp_buff(ptr)) {
-   struct xdp_buff *xdp = tun_ptr_to_xdp(ptr);
+   if (tun_is_xdp_frame(ptr)) {
+   struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
 
-   return xdp->data_end - xdp->data;
+   return xdpf->len;
}
return __skb_array_len_with_tag(ptr);
} else {
diff --git a/d

[net-next V9 PATCH 00/16] XDP redirect memory return API

2018-04-03 Thread Jesper Dangaard Brouer
This is V9, but it's worth mentioning that V8 was send against
net-next, because i40e got XDP_REDIRECT support in-between V6, and it
doesn't exist in bpf-next yet.  Most significant change in V8 was that
page_pool only gets compiled into the kernel when a drivers Kconfig
'select' the feature.

This patchset works towards supporting different XDP RX-ring memory
allocators.  As this will be needed by the AF_XDP zero-copy mode.

The patchset uses mlx5 as the sample driver, which gets implemented
XDP_REDIRECT RX-mode, but not ndo_xdp_xmit (as this API is subject to
change thought the patchset).

A new struct xdp_frame is introduced (modeled after cpumap xdp_pkt).
And both ndo_xdp_xmit and the new xdp_return_frame end-up using this.

Support for a driver supplied allocator is implemented, and a
refurbished version of page_pool is the first return allocator type
introduced.  This will be a integration point for AF_XDP zero-copy.

The mlx5 driver evolve into using the page_pool, and see a performance
increase (with ndo_xdp_xmit out ixgbe driver) from 6Mpps to 12Mpps.


The patchset stop at 16 patches (one over limit), but more API changes
are planned.  Specifically extending ndo_xdp_xmit and xdp_return_frame
APIs to support bulking.  As this will address some known limits.

V2: Updated according to Tariq's feedback
V3: Updated based on feedback from Jason Wang and Alex Duyck
V4: Updated based on feedback from Tariq and Jason
V5: Fix SPDX license, add Tariq's reviews, improve patch desc for perf test
V6: Updated based on feedback from Eric Dumazet and Alex Duyck
V7: Adapt to i40e that got XDP_REDIRECT support in-between
V8: Updated based on feedback kbuild test robot, and adjust for mlx5 changes
V9:
 Remove some inline statements, let compiler decide what to inline
 Fix return value in virtio_net driver
 Adjust for mlx5 changes in-between submissions

---

Jesper Dangaard Brouer (16):
  mlx5: basic XDP_REDIRECT forward support
  xdp: introduce xdp_return_frame API and use in cpumap
  ixgbe: use xdp_return_frame API
  xdp: move struct xdp_buff from filter.h to xdp.h
  xdp: introduce a new xdp_frame type
  tun: convert to use generic xdp_frame and xdp_return_frame API
  virtio_net: convert to use generic xdp_frame and xdp_return_frame API
  bpf: cpumap convert to use generic xdp_frame
  i40e: convert to use generic xdp_frame and xdp_return_frame API
  mlx5: register a memory model when XDP is enabled
  xdp: rhashtable with allocator ID to pointer mapping
  page_pool: refurbish version of page_pool code
  xdp: allow page_pool as an allocator type in xdp_return_frame
  mlx5: use page_pool for xdp_return_frame call
  xdp: transition into using xdp_frame for return API
  xdp: transition into using xdp_frame for ndo_xdp_xmit


 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |   33 ++
 drivers/net/ethernet/intel/i40e/i40e_txrx.h   |3 
 drivers/net/ethernet/intel/ixgbe/ixgbe.h  |3 
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   37 ++
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig   |1 
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |4 
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |   37 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |   42 ++-
 drivers/net/tun.c |   60 ++--
 drivers/net/virtio_net.c  |   67 +++-
 drivers/vhost/net.c   |7 
 include/linux/filter.h|   24 --
 include/linux/if_tun.h|4 
 include/linux/netdevice.h |4 
 include/net/page_pool.h   |  143 +
 include/net/xdp.h |   83 +
 kernel/bpf/cpumap.c   |  132 +++--
 net/Kconfig   |3 
 net/core/Makefile |1 
 net/core/filter.c |   17 +
 net/core/page_pool.c  |  317 +
 net/core/xdp.c|  269 ++
 22 files changed, 1093 insertions(+), 198 deletions(-)
 create mode 100644 include/net/page_pool.h
 create mode 100644 net/core/page_pool.c

--


[net-next V9 PATCH 01/16] mlx5: basic XDP_REDIRECT forward support

2018-04-03 Thread Jesper Dangaard Brouer
This implements basic XDP redirect support in mlx5 driver.

Notice that the ndo_xdp_xmit() is NOT implemented, because that API
need some changes that this patchset is working towards.

The main purpose of this patch is have different drivers doing
XDP_REDIRECT to show how different memory models behave in a cross
driver world.

Update(pre-RFCv2 Tariq): Need to DMA unmap page before xdp_do_redirect,
as the return API does not exist yet to to keep this mapped.

Update(pre-RFCv3 Saeed): Don't mix XDP_TX and XDP_REDIRECT flushing,
introduce xdpsq.db.redirect_flush boolian.

V9: Adjust for commit 121e89275471 ("net/mlx5e: Refactor RQ XDP_TX indication")

Signed-off-by: Jesper Dangaard Brouer 
Reviewed-by: Tariq Toukan 
Acked-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h|1 +
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |   27 ---
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 30cad07be2b5..1a05d1072c5e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -392,6 +392,7 @@ struct mlx5e_xdpsq {
struct {
struct mlx5e_dma_info *di;
bool   doorbell;
+   bool   redirect_flush;
} db;
 
/* read only */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 176645762e49..0e24be05907f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -236,14 +236,20 @@ static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq 
*rq,
return 0;
 }
 
+static void mlx5e_page_dma_unmap(struct mlx5e_rq *rq,
+   struct mlx5e_dma_info *dma_info)
+{
+   dma_unmap_page(rq->pdev, dma_info->addr, RQ_PAGE_SIZE(rq),
+  rq->buff.map_dir);
+}
+
 void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
bool recycle)
 {
if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info))
return;
 
-   dma_unmap_page(rq->pdev, dma_info->addr, RQ_PAGE_SIZE(rq),
-  rq->buff.map_dir);
+   mlx5e_page_dma_unmap(rq, dma_info);
put_page(dma_info->page);
 }
 
@@ -800,9 +806,10 @@ static inline int mlx5e_xdp_handle(struct mlx5e_rq *rq,
   struct mlx5e_dma_info *di,
   void *va, u16 *rx_headroom, u32 *len)
 {
-   const struct bpf_prog *prog = READ_ONCE(rq->xdp_prog);
+   struct bpf_prog *prog = READ_ONCE(rq->xdp_prog);
struct xdp_buff xdp;
u32 act;
+   int err;
 
if (!prog)
return false;
@@ -823,6 +830,15 @@ static inline int mlx5e_xdp_handle(struct mlx5e_rq *rq,
if (unlikely(!mlx5e_xmit_xdp_frame(rq, di, &xdp)))
trace_xdp_exception(rq->netdev, prog, act);
return true;
+   case XDP_REDIRECT:
+   /* When XDP enabled then page-refcnt==1 here */
+   err = xdp_do_redirect(rq->netdev, &xdp, prog);
+   if (!err) {
+   __set_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags);
+   rq->xdpsq.db.redirect_flush = true;
+   mlx5e_page_dma_unmap(rq, di);
+   }
+   return true;
default:
bpf_warn_invalid_xdp_action(act);
case XDP_ABORTED:
@@ -1140,6 +1156,11 @@ int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
xdpsq->db.doorbell = false;
}
 
+   if (xdpsq->db.redirect_flush) {
+   xdp_do_flush_map();
+   xdpsq->db.redirect_flush = false;
+   }
+
mlx5_cqwq_update_db_record(&cq->wq);
 
/* ensure cq space is freed before enabling more cqes */



[net-next V9 PATCH 04/16] xdp: move struct xdp_buff from filter.h to xdp.h

2018-04-03 Thread Jesper Dangaard Brouer
This is done to prepare for the next patch, and it is also
nice to move this XDP related struct out of filter.h.

Signed-off-by: Jesper Dangaard Brouer 
---
 include/linux/filter.h |   24 +---
 include/net/xdp.h  |   22 ++
 2 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index fc4e8f91b03d..4da8b2308174 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -30,6 +30,7 @@ struct sock;
 struct seccomp_data;
 struct bpf_prog_aux;
 struct xdp_rxq_info;
+struct xdp_buff;
 
 /* ArgX, context and stack frame pointer register positions. Note,
  * Arg1, Arg2, Arg3, etc are used as argument mappings of function
@@ -500,14 +501,6 @@ struct bpf_skb_data_end {
void *data_end;
 };
 
-struct xdp_buff {
-   void *data;
-   void *data_end;
-   void *data_meta;
-   void *data_hard_start;
-   struct xdp_rxq_info *rxq;
-};
-
 struct sk_msg_buff {
void *data;
void *data_end;
@@ -772,21 +765,6 @@ int xdp_do_redirect(struct net_device *dev,
struct bpf_prog *prog);
 void xdp_do_flush_map(void);
 
-/* Drivers not supporting XDP metadata can use this helper, which
- * rejects any room expansion for metadata as a result.
- */
-static __always_inline void
-xdp_set_data_meta_invalid(struct xdp_buff *xdp)
-{
-   xdp->data_meta = xdp->data + 1;
-}
-
-static __always_inline bool
-xdp_data_meta_unsupported(const struct xdp_buff *xdp)
-{
-   return unlikely(xdp->data_meta > xdp->data);
-}
-
 void bpf_warn_invalid_xdp_action(u32 act);
 
 struct sock *do_sk_redirect_map(struct sk_buff *skb);
diff --git a/include/net/xdp.h b/include/net/xdp.h
index e4207699c410..15f8ade008b5 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -50,6 +50,13 @@ struct xdp_rxq_info {
struct xdp_mem_info mem;
 } cacheline_aligned; /* perf critical, avoid false-sharing */
 
+struct xdp_buff {
+   void *data;
+   void *data_end;
+   void *data_meta;
+   void *data_hard_start;
+   struct xdp_rxq_info *rxq;
+};
 
 static inline
 void xdp_return_frame(void *data, struct xdp_mem_info *mem)
@@ -72,4 +79,19 @@ bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq);
 int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
   enum xdp_mem_type type, void *allocator);
 
+/* Drivers not supporting XDP metadata can use this helper, which
+ * rejects any room expansion for metadata as a result.
+ */
+static __always_inline void
+xdp_set_data_meta_invalid(struct xdp_buff *xdp)
+{
+   xdp->data_meta = xdp->data + 1;
+}
+
+static __always_inline bool
+xdp_data_meta_unsupported(const struct xdp_buff *xdp)
+{
+   return unlikely(xdp->data_meta > xdp->data);
+}
+
 #endif /* __LINUX_NET_XDP_H__ */



[net-next V9 PATCH 08/16] bpf: cpumap convert to use generic xdp_frame

2018-04-03 Thread Jesper Dangaard Brouer
The generic xdp_frame format, was inspired by the cpumap own internal
xdp_pkt format.  It is now time to convert it over to the generic
xdp_frame format.  The cpumap needs one extra field dev_rx.

Signed-off-by: Jesper Dangaard Brouer 
---
 include/net/xdp.h   |1 +
 kernel/bpf/cpumap.c |  100 ++-
 2 files changed, 29 insertions(+), 72 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 756c42811e78..ea3773f94f65 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -67,6 +67,7 @@ struct xdp_frame {
 * while mem info is valid on remote CPU.
 */
struct xdp_mem_info mem;
+   struct net_device *dev_rx; /* used by cpumap */
 };
 
 /* Convert xdp_buff to xdp_frame */
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 3e4bbcbe3e86..bcdc4dea5ce7 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -159,52 +159,8 @@ static void cpu_map_kthread_stop(struct work_struct *work)
kthread_stop(rcpu->kthread);
 }
 
-/* For now, xdp_pkt is a cpumap internal data structure, with info
- * carried between enqueue to dequeue. It is mapped into the top
- * headroom of the packet, to avoid allocating separate mem.
- */
-struct xdp_pkt {
-   void *data;
-   u16 len;
-   u16 headroom;
-   u16 metasize;
-   /* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
-* while mem info is valid on remote CPU.
-*/
-   struct xdp_mem_info mem;
-   struct net_device *dev_rx;
-};
-
-/* Convert xdp_buff to xdp_pkt */
-static struct xdp_pkt *convert_to_xdp_pkt(struct xdp_buff *xdp)
-{
-   struct xdp_pkt *xdp_pkt;
-   int metasize;
-   int headroom;
-
-   /* Assure headroom is available for storing info */
-   headroom = xdp->data - xdp->data_hard_start;
-   metasize = xdp->data - xdp->data_meta;
-   metasize = metasize > 0 ? metasize : 0;
-   if (unlikely((headroom - metasize) < sizeof(*xdp_pkt)))
-   return NULL;
-
-   /* Store info in top of packet */
-   xdp_pkt = xdp->data_hard_start;
-
-   xdp_pkt->data = xdp->data;
-   xdp_pkt->len  = xdp->data_end - xdp->data;
-   xdp_pkt->headroom = headroom - sizeof(*xdp_pkt);
-   xdp_pkt->metasize = metasize;
-
-   /* rxq only valid until napi_schedule ends, convert to xdp_mem_info */
-   xdp_pkt->mem = xdp->rxq->mem;
-
-   return xdp_pkt;
-}
-
 static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
-struct xdp_pkt *xdp_pkt)
+struct xdp_frame *xdpf)
 {
unsigned int frame_size;
void *pkt_data_start;
@@ -219,7 +175,7 @@ static struct sk_buff *cpu_map_build_skb(struct 
bpf_cpu_map_entry *rcpu,
 * would be preferred to set frame_size to 2048 or 4096
 * depending on the driver.
 *   frame_size = 2048;
-*   frame_len  = frame_size - sizeof(*xdp_pkt);
+*   frame_len  = frame_size - sizeof(*xdp_frame);
 *
 * Instead, with info avail, skb_shared_info in placed after
 * packet len.  This, unfortunately fakes the truesize.
@@ -227,21 +183,21 @@ static struct sk_buff *cpu_map_build_skb(struct 
bpf_cpu_map_entry *rcpu,
 * is not at a fixed memory location, with mixed length
 * packets, which is bad for cache-line hotness.
 */
-   frame_size = SKB_DATA_ALIGN(xdp_pkt->len) + xdp_pkt->headroom +
+   frame_size = SKB_DATA_ALIGN(xdpf->len) + xdpf->headroom +
SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 
-   pkt_data_start = xdp_pkt->data - xdp_pkt->headroom;
+   pkt_data_start = xdpf->data - xdpf->headroom;
skb = build_skb(pkt_data_start, frame_size);
if (!skb)
return NULL;
 
-   skb_reserve(skb, xdp_pkt->headroom);
-   __skb_put(skb, xdp_pkt->len);
-   if (xdp_pkt->metasize)
-   skb_metadata_set(skb, xdp_pkt->metasize);
+   skb_reserve(skb, xdpf->headroom);
+   __skb_put(skb, xdpf->len);
+   if (xdpf->metasize)
+   skb_metadata_set(skb, xdpf->metasize);
 
/* Essential SKB info: protocol and skb->dev */
-   skb->protocol = eth_type_trans(skb, xdp_pkt->dev_rx);
+   skb->protocol = eth_type_trans(skb, xdpf->dev_rx);
 
/* Optional SKB info, currently missing:
 * - HW checksum info   (skb->ip_summed)
@@ -259,11 +215,11 @@ static void __cpu_map_ring_cleanup(struct ptr_ring *ring)
 * invoked cpu_map_kthread_stop(). Catch any broken behaviour
 * gracefully and warn once.
 */
-   struct xdp_pkt *xdp_pkt;
+   struct xdp_frame *xdpf;
 
-   while ((xdp_pkt = ptr_ring_consume(ring)))
-   if (WARN_ON_ONCE(xdp_pkt))
-   xdp_return_frame(xdp_pkt, &xdp_pkt->mem);
+   while ((xdpf = ptr_ring_consume(ring)))
+   if (WARN_ON_ONCE(xdpf))
+

[net-next V9 PATCH 05/16] xdp: introduce a new xdp_frame type

2018-04-03 Thread Jesper Dangaard Brouer
This is needed to convert drivers tuntap and virtio_net.

This is a generalization of what is done inside cpumap, which will be
converted later.

Signed-off-by: Jesper Dangaard Brouer 
---
 include/net/xdp.h |   40 
 1 file changed, 40 insertions(+)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 15f8ade008b5..756c42811e78 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -58,6 +58,46 @@ struct xdp_buff {
struct xdp_rxq_info *rxq;
 };
 
+struct xdp_frame {
+   void *data;
+   u16 len;
+   u16 headroom;
+   u16 metasize;
+   /* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
+* while mem info is valid on remote CPU.
+*/
+   struct xdp_mem_info mem;
+};
+
+/* Convert xdp_buff to xdp_frame */
+static inline
+struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
+{
+   struct xdp_frame *xdp_frame;
+   int metasize;
+   int headroom;
+
+   /* Assure headroom is available for storing info */
+   headroom = xdp->data - xdp->data_hard_start;
+   metasize = xdp->data - xdp->data_meta;
+   metasize = metasize > 0 ? metasize : 0;
+   if (unlikely((headroom - metasize) < sizeof(*xdp_frame)))
+   return NULL;
+
+   /* Store info in top of packet */
+   xdp_frame = xdp->data_hard_start;
+
+   xdp_frame->data = xdp->data;
+   xdp_frame->len  = xdp->data_end - xdp->data;
+   xdp_frame->headroom = headroom - sizeof(*xdp_frame);
+   xdp_frame->metasize = metasize;
+
+   /* rxq only valid until napi_schedule ends, convert to xdp_mem_info */
+   xdp_frame->mem = xdp->rxq->mem;
+
+   return xdp_frame;
+}
+
 static inline
 void xdp_return_frame(void *data, struct xdp_mem_info *mem)
 {



[net-next V9 PATCH 12/16] page_pool: refurbish version of page_pool code

2018-04-03 Thread Jesper Dangaard Brouer
Need a fast page recycle mechanism for ndo_xdp_xmit API for returning
pages on DMA-TX completion time, which have good cross CPU
performance, given DMA-TX completion time can happen on a remote CPU.

Refurbish my page_pool code, that was presented[1] at MM-summit 2016.
Adapted page_pool code to not depend the page allocator and
integration into struct page.  The DMA mapping feature is kept,
even-though it will not be activated/used in this patchset.

[1] 
http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf

V2: Adjustments requested by Tariq
 - Changed page_pool_create return codes, don't return NULL, only
   ERR_PTR, as this simplifies err handling in drivers.

V4: many small improvements and cleanups
- Add DOC comment section, that can be used by kernel-doc
- Improve fallback mode, to work better with refcnt based recycling
  e.g. remove a WARN as pointed out by Tariq
  e.g. quicker fallback if ptr_ring is empty.

V5: Fixed SPDX license as pointed out by Alexei

V6: Adjustments requested by Eric Dumazet
 - Adjust cacheline_aligned_in_smp usage/placement
 - Move rcu_head in struct page_pool
 - Free pages quicker on destroy, minimize resources delayed an RCU period
 - Remove code for forward/backward compat ABI interface

V8: Issues found by kbuild test robot
 - Address sparse should be static warnings
 - Only compile+link when a driver use/select page_pool,
   mlx5 selects CONFIG_PAGE_POOL, although its first used in two patches

Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig |1 
 include/net/page_pool.h |  129 +
 net/Kconfig |3 
 net/core/Makefile   |1 
 net/core/page_pool.c|  317 +++
 5 files changed, 451 insertions(+)
 create mode 100644 include/net/page_pool.h
 create mode 100644 net/core/page_pool.c

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig 
b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
index c032319f1cb9..12257034131e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
@@ -30,6 +30,7 @@ config MLX5_CORE_EN
bool "Mellanox Technologies ConnectX-4 Ethernet support"
depends on NETDEVICES && ETHERNET && INET && PCI && MLX5_CORE
depends on IPV6=y || IPV6=n || MLX5_CORE=m
+   select PAGE_POOL
default n
---help---
  Ethernet support in Mellanox Technologies ConnectX-4 NIC.
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
new file mode 100644
index ..1fe77db59518
--- /dev/null
+++ b/include/net/page_pool.h
@@ -0,0 +1,129 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * page_pool.h
+ * Author: Jesper Dangaard Brouer 
+ * Copyright (C) 2016 Red Hat, Inc.
+ */
+
+/**
+ * DOC: page_pool allocator
+ *
+ * This page_pool allocator is optimized for the XDP mode that
+ * uses one-frame-per-page, but have fallbacks that act like the
+ * regular page allocator APIs.
+ *
+ * Basic use involve replacing alloc_pages() calls with the
+ * page_pool_alloc_pages() call.  Drivers should likely use
+ * page_pool_dev_alloc_pages() replacing dev_alloc_pages().
+ *
+ * If page_pool handles DMA mapping (use page->private), then API user
+ * is responsible for invoking page_pool_put_page() once.  In-case of
+ * elevated refcnt, the DMA state is released, assuming other users of
+ * the page will eventually call put_page().
+ *
+ * If no DMA mapping is done, then it can act as shim-layer that
+ * fall-through to alloc_page.  As no state is kept on the page, the
+ * regular put_page() call is sufficient.
+ */
+#ifndef _NET_PAGE_POOL_H
+#define _NET_PAGE_POOL_H
+
+#include  /* Needed by ptr_ring */
+#include 
+#include 
+
+#define PP_FLAG_DMA_MAP 1 /* Should page_pool do the DMA map/unmap */
+#define PP_FLAG_ALLPP_FLAG_DMA_MAP
+
+/*
+ * Fast allocation side cache array/stack
+ *
+ * The cache size and refill watermark is related to the network
+ * use-case.  The NAPI budget is 64 packets.  After a NAPI poll the RX
+ * ring is usually refilled and the max consumed elements will be 64,
+ * thus a natural max size of objects needed in the cache.
+ *
+ * Keeping room for more objects, is due to XDP_DROP use-case.  As
+ * XDP_DROP allows the opportunity to recycle objects directly into
+ * this array, as it shares the same softirq/NAPI protection.  If
+ * cache is already full (or partly full) then the XDP_DROP recycles
+ * would have to take a slower code path.
+ */
+#define PP_ALLOC_CACHE_SIZE128
+#define PP_ALLOC_CACHE_REFILL  64
+struct pp_alloc_cache {
+   u32 count;
+   void *cache[PP_ALLOC_CACHE_SIZE];
+};
+
+struct page_pool_params {
+   unsigned intflags;
+   unsigned intorder;
+   unsigned intpool_size;
+   int nid;  /* Numa node id to allocate from pages from 

[net-next V9 PATCH 09/16] i40e: convert to use generic xdp_frame and xdp_return_frame API

2018-04-03 Thread Jesper Dangaard Brouer
Also convert driver i40e, which very recently got XDP_REDIRECT support
in commit d9314c474d4f ("i40e: add support for XDP_REDIRECT").

V7: This patch got added in V7 of this patchset.

Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c |   20 +++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |1 +
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index f174c72480ab..96c54cbfb1f9 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -638,7 +638,8 @@ static void i40e_unmap_and_free_tx_resource(struct 
i40e_ring *ring,
if (tx_buffer->tx_flags & I40E_TX_FLAGS_FD_SB)
kfree(tx_buffer->raw_buf);
else if (ring_is_xdp(ring))
-   page_frag_free(tx_buffer->raw_buf);
+   xdp_return_frame(tx_buffer->xdpf->data,
+&tx_buffer->xdpf->mem);
else
dev_kfree_skb_any(tx_buffer->skb);
if (dma_unmap_len(tx_buffer, len))
@@ -841,7 +842,7 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
 
/* free the skb/XDP data */
if (ring_is_xdp(tx_ring))
-   page_frag_free(tx_buf->raw_buf);
+   xdp_return_frame(tx_buf->xdpf->data, 
&tx_buf->xdpf->mem);
else
napi_consume_skb(tx_buf->skb, napi_budget);
 
@@ -2225,6 +2226,8 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring 
*rx_ring,
if (!xdp_prog)
goto xdp_out;
 
+   prefetchw(xdp->data_hard_start); /* xdp_frame write */
+
act = bpf_prog_run_xdp(xdp_prog, xdp);
switch (act) {
case XDP_PASS:
@@ -3481,25 +3484,32 @@ static inline int i40e_tx_map(struct i40e_ring 
*tx_ring, struct sk_buff *skb,
 static int i40e_xmit_xdp_ring(struct xdp_buff *xdp,
  struct i40e_ring *xdp_ring)
 {
-   u32 size = xdp->data_end - xdp->data;
u16 i = xdp_ring->next_to_use;
struct i40e_tx_buffer *tx_bi;
struct i40e_tx_desc *tx_desc;
+   struct xdp_frame *xdpf;
dma_addr_t dma;
+   u32 size;
+
+   xdpf = convert_to_xdp_frame(xdp);
+   if (unlikely(!xdpf))
+   return I40E_XDP_CONSUMED;
+
+   size = xdpf->len;
 
if (!unlikely(I40E_DESC_UNUSED(xdp_ring))) {
xdp_ring->tx_stats.tx_busy++;
return I40E_XDP_CONSUMED;
}
 
-   dma = dma_map_single(xdp_ring->dev, xdp->data, size, DMA_TO_DEVICE);
+   dma = dma_map_single(xdp_ring->dev, xdpf->data, size, DMA_TO_DEVICE);
if (dma_mapping_error(xdp_ring->dev, dma))
return I40E_XDP_CONSUMED;
 
tx_bi = &xdp_ring->tx_bi[i];
tx_bi->bytecount = size;
tx_bi->gso_segs = 1;
-   tx_bi->raw_buf = xdp->data;
+   tx_bi->xdpf = xdpf;
 
/* record length, and DMA address */
dma_unmap_len_set(tx_bi, len, size);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index 3043483ec426..857b1d743c8d 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -306,6 +306,7 @@ static inline unsigned int i40e_txd_use_count(unsigned int 
size)
 struct i40e_tx_buffer {
struct i40e_tx_desc *next_to_watch;
union {
+   struct xdp_frame *xdpf;
struct sk_buff *skb;
void *raw_buf;
};



[net-next V9 PATCH 10/16] mlx5: register a memory model when XDP is enabled

2018-04-03 Thread Jesper Dangaard Brouer
Now all the users of ndo_xdp_xmit have been converted to use xdp_return_frame.
This enable a different memory model, thus activating another code path
in the xdp_return_frame API.

V2: Fixed issues pointed out by Tariq.

Signed-off-by: Jesper Dangaard Brouer 
Reviewed-by: Tariq Toukan 
Acked-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 0aab3afc6885..13c1e61258a7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -512,6 +512,14 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
rq->mkey_be = c->mkey_be;
}
 
+   /* This must only be activate for order-0 pages */
+   if (rq->xdp_prog) {
+   err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
+MEM_TYPE_PAGE_ORDER0, NULL);
+   if (err)
+   goto err_rq_wq_destroy;
+   }
+
for (i = 0; i < wq_sz; i++) {
struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(&rq->wq, i);
 



[net-next V9 PATCH 13/16] xdp: allow page_pool as an allocator type in xdp_return_frame

2018-04-03 Thread Jesper Dangaard Brouer
New allocator type MEM_TYPE_PAGE_POOL for page_pool usage.

The registered allocator page_pool pointer is not available directly
from xdp_rxq_info, but it could be (if needed).  For now, the driver
should keep separate track of the page_pool pointer, which it should
use for RX-ring page allocation.

As suggested by Saeed, to maintain a symmetric API it is the drivers
responsibility to allocate/create and free/destroy the page_pool.
Thus, after the driver have called xdp_rxq_info_unreg(), it is drivers
responsibility to free the page_pool, but with a RCU free call.  This
is done easily via the page_pool helper page_pool_destroy() (which
avoids touching any driver code during the RCU callback, which could
happen after the driver have been unloaded).

V8: address issues found by kbuild test robot
 - Address sparse should be static warnings
 - Allow xdp.o to be compiled without page_pool.o

V9: Remove inline from .c file, compiler knows best

Signed-off-by: Jesper Dangaard Brouer 
---
 include/net/page_pool.h |   14 +++
 include/net/xdp.h   |3 ++
 net/core/xdp.c  |   60 ++-
 3 files changed, 65 insertions(+), 12 deletions(-)

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 1fe77db59518..c79087153148 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -117,7 +117,12 @@ void __page_pool_put_page(struct page_pool *pool,
 
 static inline void page_pool_put_page(struct page_pool *pool, struct page 
*page)
 {
+   /* When page_pool isn't compiled-in, net/core/xdp.c doesn't
+* allow registering MEM_TYPE_PAGE_POOL, but shield linker.
+*/
+#ifdef CONFIG_PAGE_POOL
__page_pool_put_page(pool, page, false);
+#endif
 }
 /* Very limited use-cases allow recycle direct */
 static inline void page_pool_recycle_direct(struct page_pool *pool,
@@ -126,4 +131,13 @@ static inline void page_pool_recycle_direct(struct 
page_pool *pool,
__page_pool_put_page(pool, page, true);
 }
 
+static inline bool is_page_pool_compiled_in(void)
+{
+#ifdef CONFIG_PAGE_POOL
+   return true;
+#else
+   return false;
+#endif
+}
+
 #endif /* _NET_PAGE_POOL_H */
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 5f67c62540aa..d0ee437753dc 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -36,6 +36,7 @@
 enum xdp_mem_type {
MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */
MEM_TYPE_PAGE_ORDER0, /* Orig XDP full page model */
+   MEM_TYPE_PAGE_POOL,
MEM_TYPE_MAX,
 };
 
@@ -44,6 +45,8 @@ struct xdp_mem_info {
u32 id;
 };
 
+struct page_pool;
+
 struct xdp_rxq_info {
struct net_device *dev;
u32 queue_index;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 8b2cb79b5de0..33e382afbd95 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -27,7 +28,10 @@ static struct rhashtable *mem_id_ht;
 
 struct xdp_mem_allocator {
struct xdp_mem_info mem;
-   void *allocator;
+   union {
+   void *allocator;
+   struct page_pool *page_pool;
+   };
struct rhash_head node;
struct rcu_head rcu;
 };
@@ -74,7 +78,9 @@ static void __xdp_mem_allocator_rcu_free(struct rcu_head *rcu)
/* Allow this ID to be reused */
ida_simple_remove(&mem_id_pool, xa->mem.id);
 
-   /* TODO: Depending on allocator type/pointer free resources */
+   /* Notice, driver is expected to free the *allocator,
+* e.g. page_pool, and MUST also use RCU free.
+*/
 
/* Poison memory */
xa->mem.id = 0x;
@@ -225,6 +231,17 @@ static int __mem_id_cyclic_get(gfp_t gfp)
return id;
 }
 
+static bool __is_supported_mem_type(enum xdp_mem_type type)
+{
+   if (type == MEM_TYPE_PAGE_POOL)
+   return is_page_pool_compiled_in();
+
+   if (type >= MEM_TYPE_MAX)
+   return false;
+
+   return true;
+}
+
 int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
   enum xdp_mem_type type, void *allocator)
 {
@@ -238,13 +255,16 @@ int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info 
*xdp_rxq,
return -EFAULT;
}
 
-   if (type >= MEM_TYPE_MAX)
-   return -EINVAL;
+   if (!__is_supported_mem_type(type))
+   return -EOPNOTSUPP;
 
xdp_rxq->mem.type = type;
 
-   if (!allocator)
+   if (!allocator) {
+   if (type == MEM_TYPE_PAGE_POOL)
+   return -EINVAL; /* Setup time check page_pool req */
return 0;
+   }
 
/* Delay init of rhashtable to save memory if feature isn't used */
if (!mem_id_init) {
@@ -290,15 +310,31 @@ EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
 
 void xdp_return_frame(void *data, struct xdp_mem_info *mem)
 {
-   if (mem->type == MEM_TYPE_PAGE_SHARED) {
+   struct xdp_mem_allocat

[net-next V9 PATCH 11/16] xdp: rhashtable with allocator ID to pointer mapping

2018-04-03 Thread Jesper Dangaard Brouer
Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info.  Instead of using the IDR infrastructure, which
uses a radix tree, use a dynamic rhashtable, for creating ID to
pointer lookup table, because this is faster.

The problem that is being solved here is that, the xdp_rxq_info
pointer (stored in xdp_buff) cannot be used directly, as the
guaranteed lifetime is too short.  The info is needed on a
(potentially) remote CPU during DMA-TX completion time . In an
xdp_frame the xdp_mem_info is stored, when it got converted from an
xdp_buff, which is sufficient for the simple page refcnt based recycle
schemes.

For more advanced allocators there is a need to store a pointer to the
registered allocator.  Thus, there is a need to guard the lifetime or
validity of the allocator pointer, which is done through this
rhashtable ID map to pointer. The removal and validity of of the
allocator and helper struct xdp_mem_allocator is guarded by RCU.  The
allocator will be created by the driver, and registered with
xdp_rxq_info_reg_mem_model().

It is up-to debate who is responsible for freeing the allocator
pointer or invoking the allocator destructor function.  In any case,
this must happen via RCU freeing.

Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info.

V4: Per req of Jason Wang
- Use xdp_rxq_info_reg_mem_model() in all drivers implementing
  XDP_REDIRECT, even-though it's not strictly necessary when
  allocator==NULL for type MEM_TYPE_PAGE_SHARED (given it's zero).

V6: Per req of Alex Duyck
- Introduce rhashtable_lookup() call in later patch

V8: Address sparse should be static warnings (from kbuild test robot)

Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |9 +
 drivers/net/tun.c |6 +
 drivers/net/virtio_net.c  |7 +
 include/net/xdp.h |   14 --
 net/core/xdp.c|  223 -
 5 files changed, 241 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 0bfe6cf2bf8b..f10904ec2172 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -6370,7 +6370,7 @@ int ixgbe_setup_rx_resources(struct ixgbe_adapter 
*adapter,
struct device *dev = rx_ring->dev;
int orig_node = dev_to_node(dev);
int ring_node = -1;
-   int size;
+   int size, err;
 
size = sizeof(struct ixgbe_rx_buffer) * rx_ring->count;
 
@@ -6407,6 +6407,13 @@ int ixgbe_setup_rx_resources(struct ixgbe_adapter 
*adapter,
 rx_ring->queue_index) < 0)
goto err;
 
+   err = xdp_rxq_info_reg_mem_model(&rx_ring->xdp_rxq,
+MEM_TYPE_PAGE_SHARED, NULL);
+   if (err) {
+   xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
+   goto err;
+   }
+
rx_ring->xdp_prog = adapter->xdp_prog;
 
return 0;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 714735c6d3ff..b52d69801b2d 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -847,6 +847,12 @@ static int tun_attach(struct tun_struct *tun, struct file 
*file,
   tun->dev, tfile->queue_index);
if (err < 0)
goto out;
+   err = xdp_rxq_info_reg_mem_model(&tfile->xdp_rxq,
+MEM_TYPE_PAGE_SHARED, NULL);
+   if (err < 0) {
+   xdp_rxq_info_unreg(&tfile->xdp_rxq);
+   goto out;
+   }
err = 0;
}
 
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f50e1ad81ad4..42d338fe9a8d 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1305,6 +1305,13 @@ static int virtnet_open(struct net_device *dev)
if (err < 0)
return err;
 
+   err = xdp_rxq_info_reg_mem_model(&vi->rq[i].xdp_rxq,
+MEM_TYPE_PAGE_SHARED, NULL);
+   if (err < 0) {
+   xdp_rxq_info_unreg(&vi->rq[i].xdp_rxq);
+   return err;
+   }
+
virtnet_napi_enable(vi->rq[i].vq, &vi->rq[i].napi);
virtnet_napi_tx_enable(vi, vi->sq[i].vq, &vi->sq[i].napi);
}
diff --git a/include/net/xdp.h b/include/net/xdp.h
index ea3773f94f65..5f67c62540aa 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -41,6 +41,7 @@ enum xdp_mem_type {
 
 struct xdp_mem_info {
u32 type; /* enum xdp_mem_type, but known size type */
+   u32 id;
 };
 

[net-next V9 PATCH 15/16] xdp: transition into using xdp_frame for return API

2018-04-03 Thread Jesper Dangaard Brouer
Changing API xdp_return_frame() to take struct xdp_frame as argument,
seems like a natural choice. But there are some subtle performance
details here that needs extra care, which is a deliberate choice.

When de-referencing xdp_frame on a remote CPU during DMA-TX
completion, result in the cache-line is change to "Shared"
state. Later when the page is reused for RX, then this xdp_frame
cache-line is written, which change the state to "Modified".

This situation already happens (naturally) for, virtio_net, tun and
cpumap as the xdp_frame pointer is the queued object.  In tun and
cpumap, the ptr_ring is used for efficiently transferring cache-lines
(with pointers) between CPUs. Thus, the only option is to
de-referencing xdp_frame.

It is only the ixgbe driver that had an optimization, in which it can
avoid doing the de-reference of xdp_frame.  The driver already have
TX-ring queue, which (in case of remote DMA-TX completion) have to be
transferred between CPUs anyhow.  In this data area, we stored a
struct xdp_mem_info and a data pointer, which allowed us to avoid
de-referencing xdp_frame.

To compensate for this, a prefetchw is used for telling the cache
coherency protocol about our access pattern.  My benchmarks show that
this prefetchw is enough to compensate the ixgbe driver.

V7: Adjust for commit d9314c474d4f ("i40e: add support for XDP_REDIRECT")
V8: Adjust for commit bd658dda4237 ("net/mlx5e: Separate dma base address
and offset in dma_sync call")

Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c |5 ++---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h|4 +---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c   |   17 +++--
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |1 +
 drivers/net/tun.c   |4 ++--
 drivers/net/virtio_net.c|2 +-
 include/net/xdp.h   |2 +-
 kernel/bpf/cpumap.c |6 +++---
 net/core/xdp.c  |4 +++-
 9 files changed, 25 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 96c54cbfb1f9..c8bf4d35fdea 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -638,8 +638,7 @@ static void i40e_unmap_and_free_tx_resource(struct 
i40e_ring *ring,
if (tx_buffer->tx_flags & I40E_TX_FLAGS_FD_SB)
kfree(tx_buffer->raw_buf);
else if (ring_is_xdp(ring))
-   xdp_return_frame(tx_buffer->xdpf->data,
-&tx_buffer->xdpf->mem);
+   xdp_return_frame(tx_buffer->xdpf);
else
dev_kfree_skb_any(tx_buffer->skb);
if (dma_unmap_len(tx_buffer, len))
@@ -842,7 +841,7 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
 
/* free the skb/XDP data */
if (ring_is_xdp(tx_ring))
-   xdp_return_frame(tx_buf->xdpf->data, 
&tx_buf->xdpf->mem);
+   xdp_return_frame(tx_buf->xdpf);
else
napi_consume_skb(tx_buf->skb, napi_budget);
 
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index abb5248e917e..7dd5038cfcc4 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -241,8 +241,7 @@ struct ixgbe_tx_buffer {
unsigned long time_stamp;
union {
struct sk_buff *skb;
-   /* XDP uses address ptr on irq_clean */
-   void *data;
+   struct xdp_frame *xdpf;
};
unsigned int bytecount;
unsigned short gso_segs;
@@ -250,7 +249,6 @@ struct ixgbe_tx_buffer {
DEFINE_DMA_UNMAP_ADDR(dma);
DEFINE_DMA_UNMAP_LEN(len);
u32 tx_flags;
-   struct xdp_mem_info xdp_mem;
 };
 
 struct ixgbe_rx_buffer {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index f10904ec2172..4f2864165723 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1216,7 +1216,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector 
*q_vector,
 
/* free the skb */
if (ring_is_xdp(tx_ring))
-   xdp_return_frame(tx_buffer->data, &tx_buffer->xdp_mem);
+   xdp_return_frame(tx_buffer->xdpf);
else
napi_consume_skb(tx_buffer->skb, napi_budget);
 
@@ -2386,6 +2386,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector 
*q_vector,
xdp.data_hard_start = xdp.data -
  ixgbe_rx_offset(rx_ring);
xdp.da

[net-next V9 PATCH 14/16] mlx5: use page_pool for xdp_return_frame call

2018-04-03 Thread Jesper Dangaard Brouer
This patch shows how it is possible to have both the driver local page
cache, which uses elevated refcnt for "catching"/avoiding SKB
put_page returns the page through the page allocator.  And at the
same time, have pages getting returned to the page_pool from
ndp_xdp_xmit DMA completion.

The performance improvement for XDP_REDIRECT in this patch is really
good.  Especially considering that (currently) the xdp_return_frame
API and page_pool_put_page() does per frame operations of both
rhashtable ID-lookup and locked return into (page_pool) ptr_ring.
(It is the plan to remove these per frame operation in a followup
patchset).

The benchmark performed was RX on mlx5 and XDP_REDIRECT out ixgbe,
with xdp_redirect_map (using devmap) . And the target/maximum
capability of ixgbe is 13Mpps (on this HW setup).

Before this patch for mlx5, XDP redirected frames were returned via
the page allocator.  The single flow performance was 6Mpps, and if I
started two flows the collective performance drop to 4Mpps, because we
hit the page allocator lock (further negative scaling occurs).

Two test scenarios need to be covered, for xdp_return_frame API, which
is DMA-TX completion running on same-CPU or cross-CPU free/return.
Results were same-CPU=10Mpps, and cross-CPU=12Mpps.  This is very
close to our 13Mpps max target.

The reason max target isn't reached in cross-CPU test, is likely due
to RX-ring DMA unmap/map overhead (which doesn't occur in ixgbe to
ixgbe testing).  It is also planned to remove this unnecessary DMA
unmap in a later patchset

V2: Adjustments requested by Tariq
 - Changed page_pool_create return codes not return NULL, only
   ERR_PTR, as this simplifies err handling in drivers.
 - Save a branch in mlx5e_page_release
 - Correct page_pool size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ

V5: Updated patch desc

V8: Adjust for b0cedc844c00 ("net/mlx5e: Remove rq_headroom field from params")
V9:
 - Adjust for 121e89275471 ("net/mlx5e: Refactor RQ XDP_TX indication")
 - Adjust for 73281b78a37a ("net/mlx5e: Derive Striding RQ size from MTU")
 - Correct handling if page_pool_create fail for 
MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ

Signed-off-by: Jesper Dangaard Brouer 
Reviewed-by: Tariq Toukan 
Acked-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |3 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |   41 +
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |   16 ++--
 3 files changed, 48 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 1a05d1072c5e..3317a4da87cb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -53,6 +53,8 @@
 #include "mlx5_core.h"
 #include "en_stats.h"
 
+struct page_pool;
+
 #define MLX5_SET_CFG(p, f, v) MLX5_SET(create_flow_group_in, p, f, v)
 
 #define MLX5E_ETH_HARD_MTU (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN)
@@ -534,6 +536,7 @@ struct mlx5e_rq {
unsigned int   hw_mtu;
struct mlx5e_xdpsq xdpsq;
DECLARE_BITMAP(flags, 8);
+   struct page_pool  *page_pool;
 
/* control */
struct mlx5_wq_ctrlwq_ctrl;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 13c1e61258a7..d0f2cd86ef32 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "eswitch.h"
 #include "en.h"
 #include "en_tc.h"
@@ -389,10 +390,11 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
  struct mlx5e_rq_param *rqp,
  struct mlx5e_rq *rq)
 {
+   struct page_pool_params pp_params = { 0 };
struct mlx5_core_dev *mdev = c->mdev;
void *rqc = rqp->rqc;
void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq);
-   u32 byte_count;
+   u32 byte_count, pool_size;
int npages;
int wq_sz;
int err;
@@ -432,9 +434,12 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 
rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
rq->buff.headroom = mlx5e_get_rq_headroom(mdev, params);
+   pool_size = 1 << params->log_rq_mtu_frames;
 
switch (rq->wq_type) {
case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
+
+   pool_size = pool_size * MLX5_MPWRQ_PAGES_PER_WQE;
rq->post_wqes = mlx5e_post_rx_mpwqes;
rq->dealloc_wqe = mlx5e_dealloc_rx_mpwqe;
 
@@ -512,13 +517,31 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
rq->mkey_be = c->mkey_be;
}
 
-   /* This must only be activate for order-0 pages */
-   if (rq->xdp_prog) {
-   err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
-MEM_TYPE_PAGE_ORDER0, NULL);
-

[net-next V9 PATCH 16/16] xdp: transition into using xdp_frame for ndo_xdp_xmit

2018-04-03 Thread Jesper Dangaard Brouer
Changing API ndo_xdp_xmit to take a struct xdp_frame instead of struct
xdp_buff.  This brings xdp_return_frame and ndp_xdp_xmit in sync.

This builds towards changing the API further to become a bulk API,
because xdp_buff is not a queue-able object while xdp_frame is.

V4: Adjust for commit 59655a5b6c83 ("tuntap: XDP_TX can use native XDP")
V7: Adjust for commit d9314c474d4f ("i40e: add support for XDP_REDIRECT")

Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |   30 ++---
 drivers/net/ethernet/intel/i40e/i40e_txrx.h   |2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   21 +-
 drivers/net/tun.c |   19 ++--
 drivers/net/virtio_net.c  |   24 
 include/linux/netdevice.h |4 ++-
 net/core/filter.c |   17 +-
 7 files changed, 72 insertions(+), 45 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index c8bf4d35fdea..87fb27ab9c24 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2203,9 +2203,20 @@ static bool i40e_is_non_eop(struct i40e_ring *rx_ring,
 #define I40E_XDP_CONSUMED 1
 #define I40E_XDP_TX 2
 
-static int i40e_xmit_xdp_ring(struct xdp_buff *xdp,
+static int i40e_xmit_xdp_ring(struct xdp_frame *xdpf,
  struct i40e_ring *xdp_ring);
 
+static int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp,
+struct i40e_ring *xdp_ring)
+{
+   struct xdp_frame *xdpf = convert_to_xdp_frame(xdp);
+
+   if (unlikely(!xdpf))
+   return I40E_XDP_CONSUMED;
+
+   return i40e_xmit_xdp_ring(xdpf, xdp_ring);
+}
+
 /**
  * i40e_run_xdp - run an XDP program
  * @rx_ring: Rx ring being processed
@@ -2233,7 +2244,7 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring 
*rx_ring,
break;
case XDP_TX:
xdp_ring = rx_ring->vsi->xdp_rings[rx_ring->queue_index];
-   result = i40e_xmit_xdp_ring(xdp, xdp_ring);
+   result = i40e_xmit_xdp_tx_ring(xdp, xdp_ring);
break;
case XDP_REDIRECT:
err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
@@ -3480,21 +3491,14 @@ static inline int i40e_tx_map(struct i40e_ring 
*tx_ring, struct sk_buff *skb,
  * @xdp: data to transmit
  * @xdp_ring: XDP Tx ring
  **/
-static int i40e_xmit_xdp_ring(struct xdp_buff *xdp,
+static int i40e_xmit_xdp_ring(struct xdp_frame *xdpf,
  struct i40e_ring *xdp_ring)
 {
u16 i = xdp_ring->next_to_use;
struct i40e_tx_buffer *tx_bi;
struct i40e_tx_desc *tx_desc;
-   struct xdp_frame *xdpf;
+   u32 size = xdpf->len;
dma_addr_t dma;
-   u32 size;
-
-   xdpf = convert_to_xdp_frame(xdp);
-   if (unlikely(!xdpf))
-   return I40E_XDP_CONSUMED;
-
-   size = xdpf->len;
 
if (!unlikely(I40E_DESC_UNUSED(xdp_ring))) {
xdp_ring->tx_stats.tx_busy++;
@@ -3684,7 +3688,7 @@ netdev_tx_t i40e_lan_xmit_frame(struct sk_buff *skb, 
struct net_device *netdev)
  *
  * Returns Zero if sent, else an error code
  **/
-int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
+int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
 {
struct i40e_netdev_priv *np = netdev_priv(dev);
unsigned int queue_index = smp_processor_id();
@@ -3697,7 +3701,7 @@ int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff 
*xdp)
if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
return -ENXIO;
 
-   err = i40e_xmit_xdp_ring(xdp, vsi->xdp_rings[queue_index]);
+   err = i40e_xmit_xdp_ring(xdpf, vsi->xdp_rings[queue_index]);
if (err != I40E_XDP_TX)
return -ENOSPC;
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index 857b1d743c8d..4bf318b8be85 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -511,7 +511,7 @@ u32 i40e_get_tx_pending(struct i40e_ring *ring, bool in_sw);
 void i40e_detect_recover_hung(struct i40e_vsi *vsi);
 int __i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size);
 bool __i40e_chk_linearize(struct sk_buff *skb);
-int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp);
+int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf);
 void i40e_xdp_flush(struct net_device *dev);
 
 /**
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 4f2864165723..0daccaf72a30 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2262,7 +2262,7 @@ static struct sk_buff *ixgbe_build_skb(struct ixgbe_ring 
*rx_r

RE: [PATCH v3 2/4] bus: fsl-mc: add restool userspace support

2018-04-03 Thread Razvan Stefanescu
Hello Andrew,

> -Original Message-
> From: Andrew Lunn [mailto:and...@lunn.ch]
> Sent: Monday, April 2, 2018 4:45 PM
> To: Ioana Ciornei 
> Cc: Arnd Bergmann ; gregkh
> ; Laurentiu Tudor
> ; Linux Kernel Mailing List  ker...@vger.kernel.org>; Stuart Yoder ; Ruxandra
> Ioana Ciocoi Radulescu ; Razvan Stefanescu
> ; Roy Pledge ;
> Networking 
> Subject: Re: [PATCH v3 2/4] bus: fsl-mc: add restool userspace support
> 
> Hi Ioana
> 
> > The commands listed above are for creating/destroying DPAA2 objects
> > in Management Complex and not for runtime configuration where
> > standard userspace tools are used.
> 
> Please can you explain why this is not just plumbing inside a
> switchdev driver?
> 
> The hardware has a number of physical ports. So on probe, i would
> expect it to create a DPMAC, DPNI, and DPIO for each port, and a linux
> netdev. From then on, standard tools are all that are needed. The
> switchdev driver can create a l2 switch object when the user uses the
> ip link add name br0 type bridge. It can then connect the switch
> object to the DPNI when the user adds an interface to the switch, etc.
> 

I'll chime in as you mentioned switchdev driver. 

DPAA2 offers several object-based abstractions for modeling network
related devices (interfaces, L2 Ethernet switch) or accelerators
(DPSECI - crypto and DPDCEI - compression), the latter not up-streamed yet.
They are modeled using various low-level resources (e.g. queues,
classification tables, physical ports) and have multiple configuration and
interconnectivity options, managed by the Management Complex. 
Resources are limited and they are only used when needed by the objects,
to accommodate more configurations and usage scenarios.
 
Some of the objects have a 1-to-1 correspondence to physical resources
(e.g. DPMACs to physical ports), while others (like DPNIs and DPSW)
can be seen as a collection of the mentioned resources. The types and 
number of such objects are not predetermined.

When the board boots up, none of them exist yet. Restool allows a user to
define the system topology, by providing a way to dynamically create, destroy
and interconnect these objects.

After an object is created, it will be presented on the fsl-mc bus. A driver
is loaded to implement the required kernel interfaces specific to that object
type. Kernel can boot and afterwards the DPAA2 objects are added, as the user
requires.

As you mentioned DPMACs: objects of this type can be connected only to a DPNI
(a network interface like object) or to a DPSW (L2 ethernet switch) port.
Likewise, a DPNI can have only one connection (to a DPMAC, a DPSW port or
another DPNI object).

Here's several examples of valid connection types:
  * DPMAC <> DPNI (standard network i/f corresponding to a physical port)
  * DPMAC <> DPSW (physical port in a switch)
  * DPNI <> DPSW (virtual network interface connected to a switch port)
  * DPNI <> DPNI

In the latter case, the two DPNIs will not be connected to any physical
port, but can be used as a point-to-point connection between two virtual
machines for instance.

So, it is not possible to connect a DPNI to a DPSW after it was connected
to a DPMAC. The DPNI-DPMAC pair would have to be disconnected and
DPMAC will be reconnected to the switch. DPNI interface that is no longer
connected to a DPMAC will be destroyed and any new addition/deletion of
a DPNI/DPMAC interface to the switch port will trigger the entire switch
re-configuration.

Best regards,
Razvan Stefanescu


Re: [PATCH net-next RFC 0/5] ipv6: sr: introduce seg6local End.BPF action

2018-04-03 Thread Mathieu Xhonneux
2018-03-31 1:03 GMT+02:00 Alexei Starovoitov :
>
> On Fri, Mar 23, 2018 at 10:15:59AM +, Mathieu Xhonneux wrote:
> > As of Linux 4.14, it is possible to define advanced local processing for
> > IPv6 packets with a Segment Routing Header through the seg6local LWT
> > infrastructure. This LWT implements the network programming principles
> > defined in the IETF “SRv6 Network Programming” draft.
> >
> > The implemented operations are generic, and it would be very interesting to
> > be able to implement user-specific seg6local actions, without having to
> > modify the kernel directly. To do so, this patchset adds an End.BPF action
> > to seg6local, powered by some specific Segment Routing-related helpers,
> > which provide SR functionalities that can be applied on the packet. This
> > BPF hook would then allow to implement specific actions at native kernel
> > speed such as OAM features, advanced SR SDN policies, SRv6 actions like
> > Segment Routing Header (SRH) encapsulation depending on the content of
> > the packet, etc ...
> >
> > This patchset is divided in 5 patches, whose main features are :
> >
> > - A new seg6local action End.BPF with the corresponding new BPF program
> >   type BPF_PROG_TYPE_LWT_SEG6LOCAL. Such attached BPF program can be
> >   passed to the LWT seg6local through netlink, the same way as the LWT
> >   BPF hook operates.
> > - 3 new BPF helpers for the seg6local BPF hook, allowing to edit/grow/
> >   shrink a SRH and apply on a packet some of the generic SRv6 actions.
> > - 1 new BPF helper for the LWT BPF IN hook, allowing to add a SRH through
> >   encapsulation (via IPv6 encapsulation or inlining if the packet contains
> >   already an IPv6 header).
> >
> > As this patchset adds a new LWT BPF hook, I took into account the result of
> > the discussions when the LWT BPF infrastructure got merged. Hence, the
> > seg6local BPF hook doesn’t allow write access to skb->data directly, only
> > the SRH can be modified through specific helpers, which ensures that the
> > integrity of the packet is maintained.
> > More details are available in the related patches messages.
> >
> > The performances of this BPF hook have been assessed with the BPF JIT
> > enabled on a Intel Xeon X3440 processors with 4 cores and 8 threads
> > clocked at 2.53 GHz. No throughput losses are noted with the seg6local
> > BPF hook when the BPF program does nothing (440kpps). Adding a 8-bytes
> > TLV (1 call each to bpf_lwt_seg6_adjust_srh and bpf_lwt_seg6_store_bytes)
> > drops the throughput to 410kpps, and inlining a SRH via
> > bpf_lwt_seg6_action drops the throughput to 420kpps.
> > All throughputs are stable.
> >
> > Any comments on the patchset are welcome.
>
> I've looked through the patches and everything looks very good.
> Feel free to resubmit without RFC tag.
Thanks, I will do this as soon as net-next opens.

>
> In patch 2 I was a bit concerned that:
> +   struct seg6_bpf_srh_state *srh_state = (struct seg6_bpf_srh_state *)
> +  &skb->cb;
> would not collide with other users of skb->cb, but it seems the way
> the hook is placed such usage should always be valid.
> Would be good to add a comment describing the situation.
Yes, it's indeed a little hack, but this should be OK since the IPv6 layer does
not use the cb field. Another solution would be to create a new field in
__sk_buff but it's more cumbersome.
I will add a comment.

>
> Looks like somewhat odd 'End.BPF' name comes from similar names in SRv6 draft.
> Do you plan to disclose such End.BPF action in the draft as well?
This is something I've discussed with David Lebrun (the author of the Segment
Routing implementation). There's no plan to disclose an End.BPF action as-is
in the draft, since eBPF is really specific to Linux, and David doesn't mind not
having a 1:1 mapping between the actions of the draft and the implemented
ones. Writing "End.BPF" instead of just "bpf" is important to indicate that the
action will advance to the next segment by itself, like all other End actions.
One could imagine adding later a T.BPF action (a transit action), whose SID
wouldn't have to be a segment, but that could still e.g. add/edit/delete TLVs.

>
> Thanks
>


Re: possible deadlock in skb_queue_tail

2018-04-03 Thread Dmitry Vyukov
On Tue, Apr 3, 2018 at 11:50 AM, Kirill Tkhai  wrote:
> On 02.04.2018 12:20, syzbot wrote:
>> Hello,
>>
>> syzbot hit the following crash on net-next commit
>> 06b19fe9a6df7aaa423cd8404ebe5ac9ec4b2960 (Sun Apr 1 03:37:33 2018 +)
>> Merge branch 'chelsio-inline-tls'
>> syzbot dashboard link: 
>> https://syzkaller.appspot.com/bug?extid=6b495100f17ca8554ab9
>>
>> Unfortunately, I don't have any reproducer for this crash yet.
>> Raw console output: 
>> https://syzkaller.appspot.com/x/log.txt?id=6218830443446272
>> Kernel config: https://syzkaller.appspot.com/x/.config?id=3327544840960562528
>> compiler: gcc (GCC) 7.1.1 20170620
>>
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: syzbot+6b495100f17ca8554...@syzkaller.appspotmail.com
>> It will help syzbot understand when the bug is fixed. See footer for details.
>> If you forward the report, please keep this part and the footer.
>>
>>
>> ==
>> WARNING: possible circular locking dependency detected
>> 4.16.0-rc6+ #290 Not tainted
>> --
>> syz-executor7/20971 is trying to acquire lock:
>>  (&af_unix_sk_receive_queue_lock_key){+.+.}, at: [<271ef0d8>] 
>> skb_queue_tail+0x26/0x150 net/core/skbuff.c:2899
>>
>> but task is already holding lock:
>>  (&(&u->lock)->rlock/1){+.+.}, at: [<4e725e14>] 
>> unix_state_double_lock+0x7b/0xb0 net/unix/af_unix.c:1088
>>
>> which lock already depends on the new lock.
>>
>>
>> the existing dependency chain (in reverse order) is:
>>
>> -> #1 (&(&u->lock)->rlock/1){+.+.}:
>>_raw_spin_lock_nested+0x28/0x40 kernel/locking/spinlock.c:354
>>sk_diag_dump_icons net/unix/diag.c:82 [inline]
>>sk_diag_fill.isra.4+0xa52/0xfe0 net/unix/diag.c:144
>>sk_diag_dump net/unix/diag.c:178 [inline]
>>unix_diag_dump+0x400/0x4f0 net/unix/diag.c:206
>>netlink_dump+0x492/0xcf0 net/netlink/af_netlink.c:2221
>>__netlink_dump_start+0x4ec/0x710 net/netlink/af_netlink.c:2318
>>netlink_dump_start include/linux/netlink.h:214 [inline]
>>unix_diag_handler_dump+0x3e7/0x750 net/unix/diag.c:307
>>__sock_diag_cmd net/core/sock_diag.c:230 [inline]
>>sock_diag_rcv_msg+0x204/0x360 net/core/sock_diag.c:261
>>netlink_rcv_skb+0x14b/0x380 net/netlink/af_netlink.c:2443
>>sock_diag_rcv+0x2a/0x40 net/core/sock_diag.c:272
>>netlink_unicast_kernel net/netlink/af_netlink.c:1307 [inline]
>>netlink_unicast+0x4c4/0x6b0 net/netlink/af_netlink.c:1333
>>netlink_sendmsg+0xa4a/0xe80 net/netlink/af_netlink.c:1896
>>sock_sendmsg_nosec net/socket.c:629 [inline]
>>sock_sendmsg+0xca/0x110 net/socket.c:639
>>sock_write_iter+0x31a/0x5d0 net/socket.c:908
>>call_write_iter include/linux/fs.h:1782 [inline]
>>new_sync_write fs/read_write.c:469 [inline]
>>__vfs_write+0x684/0x970 fs/read_write.c:482
>>vfs_write+0x189/0x510 fs/read_write.c:544
>>SYSC_write fs/read_write.c:589 [inline]
>>SyS_write+0xef/0x220 fs/read_write.c:581
>>do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
>>entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>
>> -> #0 (&af_unix_sk_receive_queue_lock_key){+.+.}:
>>lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920
>>__raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
>>_raw_spin_lock_irqsave+0x96/0xc0 kernel/locking/spinlock.c:152
>>skb_queue_tail+0x26/0x150 net/core/skbuff.c:2899
>>unix_dgram_sendmsg+0xa30/0x1610 net/unix/af_unix.c:1807
>>sock_sendmsg_nosec net/socket.c:629 [inline]
>>sock_sendmsg+0xca/0x110 net/socket.c:639
>>___sys_sendmsg+0x320/0x8b0 net/socket.c:2047
>>__sys_sendmmsg+0x1ee/0x620 net/socket.c:2137
>>SYSC_sendmmsg net/socket.c:2168 [inline]
>>SyS_sendmmsg+0x35/0x60 net/socket.c:2163
>>do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
>>entry_SYSCALL_64_after_hwframe+0x42/0xb7
>
> sk_diag_dump_icons() dumps only sockets in TCP_LISTEN state.
> TCP_LISTEN state may be assigned in only place in net/unix/af_unix.c:
> it's unix_listen(). The function is applied to stream and seqpacket
> socket types.
>
> It can't be stream because of the second stack, and seqpacket also can't,
> as I don't think it's possible for gcc to inline unix_seqpacket_sendmsg()
> in the way, we don't see it in the stack.
>
> So, this is looks like false positive result for me.
>
> Kirill

Do you mean that these &(&u->lock)->rlock/1 referenced in 2 stacks are
always different?

+Ingo for lockdep false positive
Do we need some kind of annotation here?


>> other info that might help us debug this:
>>
>>  Possible unsafe locking scenario:
>>
>>CPU0CPU1
>>
>>   lock(&(&u->lock)->rlock/1);
>>lock(&af_unix_

Re: WARNING: refcount bug in should_fail

2018-04-03 Thread Dmitry Vyukov
On Tue, Apr 3, 2018 at 7:20 AM, Al Viro  wrote:
> On Mon, Apr 02, 2018 at 10:59:34PM +0100, Al Viro wrote:
>
>> FWIW, I'm going through the ->kill_sb() instances, fixing that sort
>> of bugs (most of them preexisting, but I should've checked instead
>> of assuming that everything's fine).  Will push out later tonight.
>
> OK, see vfs.git#for-linus.  Caught: 4 old bugs (allocation failure
> in fill_super oopses ->kill_sb() in hypfs, jffs2 and orangefs resp.
> and double-dput in late failure exit in rpc_fill_super())
> and 5 regressions from register_shrinker() failure recovery.

Nice!


Re: [PATCH net-next 02/12] clk: sunxi-ng: r40: export a regmap to access the GMAC register

2018-04-03 Thread Maxime Ripard
On Tue, Apr 03, 2018 at 05:58:05PM +0800, Chen-Yu Tsai wrote:
> On Tue, Apr 3, 2018 at 5:54 PM, Icenowy Zheng  wrote:
> >
> >
> > 于 2018年4月3日 GMT+08:00 下午5:53:08, Chen-Yu Tsai  写到:
> >>On Tue, Apr 3, 2018 at 5:50 PM, Maxime Ripard
> >> wrote:
> >>> On Tue, Apr 03, 2018 at 11:48:45AM +0200, Maxime Ripard wrote:
>  On Tue, Mar 20, 2018 at 03:15:02PM +0800, Chen-Yu Tsai wrote:
>  > On Mon, Mar 19, 2018 at 5:31 AM, Maxime Ripard
>  >  wrote:
>  > > On Sat, Mar 17, 2018 at 05:28:47PM +0800, Chen-Yu Tsai wrote:
>  > >> From: Icenowy Zheng 
>  > >>
>  > >> There's a GMAC configuration register, which exists on
> >>A64/A83T/H3/H5 in
>  > >> the syscon part, in the CCU of R40 SoC.
>  > >>
>  > >> Export a regmap of the CCU.
>  > >>
>  > >> Read access is not restricted to all registers, but only the
> >>GMAC
>  > >> register is allowed to be written.
>  > >>
>  > >> Signed-off-by: Icenowy Zheng 
>  > >> Signed-off-by: Chen-Yu Tsai 
>  > >
>  > > Gah, this is crazy. I'm really starting to regret letting that
> >>syscon
>  > > in in the first place...
>  >
>  > IMHO syscon is really a better fit. It's part of the glue layer
> >>and
>  > most other dwmac user platforms treat it as such and use a syscon.
>  > Plus the controls encompass delays (phase), inverters (polarity),
>  > and even signal routing. It's not really just a group of clock
> >>controls,
>  > like what we poorly modeled for A20/A31. I think that was really a
>  > mistake.
>  >
>  > As I mentioned in the cover letter, a slightly saner approach
> >>would
>  > be to let drivers add custom syscon entries, which would then
> >>require
>  > less custom plumbing.
> 
>  A syscon is convenient, sure, but it also bypasses any abstraction
>  layer we have everywhere else, which means that we'll have to
> >>maintain
>  the register layout in each and every driver that uses it.
> 
>  So far, it's only be the GMAC, but it can also be others (the SRAM
>  controller comes to my mind), and then, if there's any difference in
>  the design in a future SoC, we'll have to maintain that in the GMAC
>  driver as well.
> >>>
> >>> I guess I forgot to say something, I'm fine with using a syscon we
> >>> already have.
> >>>
> >>> I'm just questionning if merging any other driver using one is the
> >>> right move.
> >>
> >>Right. So in this case, we are not actually going through the syscon
> >>API. Rather we are exporting a regmap whose properties we actually
> >>define. If it makes you more acceptable to it, we could map just
> >>the GMAC register in the new regmap, and also have it named. This
> >>is all plumbing within the kernel so the device tree stays the same.
> >
> > I think my driver has already restricted the write permission
> > only to GMAC register.
> 
> Correct, but it still maps the entire region out, which means the
> consumer needs to know which offset to use. Maxime is saying this
> is something that is troublesome to maintain. So my proposal was
> to create a regmap with a base at the GMAC register offset. That
> way, the consumer doesn't need to use an offset to access it.

I guess this is something we can keep in mind if it gets out of
control yse.

Maxime

-- 
Maxime Ripard, Bootlin (formerly Free Electrons)
Embedded Linux and Kernel engineering
https://bootlin.com


signature.asc
Description: PGP signature


Re: possible deadlock in skb_queue_tail

2018-04-03 Thread Kirill Tkhai
On 03.04.2018 14:25, Dmitry Vyukov wrote:
> On Tue, Apr 3, 2018 at 11:50 AM, Kirill Tkhai  wrote:
>> On 02.04.2018 12:20, syzbot wrote:
>>> Hello,
>>>
>>> syzbot hit the following crash on net-next commit
>>> 06b19fe9a6df7aaa423cd8404ebe5ac9ec4b2960 (Sun Apr 1 03:37:33 2018 +)
>>> Merge branch 'chelsio-inline-tls'
>>> syzbot dashboard link: 
>>> https://syzkaller.appspot.com/bug?extid=6b495100f17ca8554ab9
>>>
>>> Unfortunately, I don't have any reproducer for this crash yet.
>>> Raw console output: 
>>> https://syzkaller.appspot.com/x/log.txt?id=6218830443446272
>>> Kernel config: 
>>> https://syzkaller.appspot.com/x/.config?id=3327544840960562528
>>> compiler: gcc (GCC) 7.1.1 20170620
>>>
>>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>>> Reported-by: syzbot+6b495100f17ca8554...@syzkaller.appspotmail.com
>>> It will help syzbot understand when the bug is fixed. See footer for 
>>> details.
>>> If you forward the report, please keep this part and the footer.
>>>
>>>
>>> ==
>>> WARNING: possible circular locking dependency detected
>>> 4.16.0-rc6+ #290 Not tainted
>>> --
>>> syz-executor7/20971 is trying to acquire lock:
>>>  (&af_unix_sk_receive_queue_lock_key){+.+.}, at: [<271ef0d8>] 
>>> skb_queue_tail+0x26/0x150 net/core/skbuff.c:2899
>>>
>>> but task is already holding lock:
>>>  (&(&u->lock)->rlock/1){+.+.}, at: [<4e725e14>] 
>>> unix_state_double_lock+0x7b/0xb0 net/unix/af_unix.c:1088
>>>
>>> which lock already depends on the new lock.
>>>
>>>
>>> the existing dependency chain (in reverse order) is:
>>>
>>> -> #1 (&(&u->lock)->rlock/1){+.+.}:
>>>_raw_spin_lock_nested+0x28/0x40 kernel/locking/spinlock.c:354
>>>sk_diag_dump_icons net/unix/diag.c:82 [inline]
>>>sk_diag_fill.isra.4+0xa52/0xfe0 net/unix/diag.c:144
>>>sk_diag_dump net/unix/diag.c:178 [inline]
>>>unix_diag_dump+0x400/0x4f0 net/unix/diag.c:206
>>>netlink_dump+0x492/0xcf0 net/netlink/af_netlink.c:2221
>>>__netlink_dump_start+0x4ec/0x710 net/netlink/af_netlink.c:2318
>>>netlink_dump_start include/linux/netlink.h:214 [inline]
>>>unix_diag_handler_dump+0x3e7/0x750 net/unix/diag.c:307
>>>__sock_diag_cmd net/core/sock_diag.c:230 [inline]
>>>sock_diag_rcv_msg+0x204/0x360 net/core/sock_diag.c:261
>>>netlink_rcv_skb+0x14b/0x380 net/netlink/af_netlink.c:2443
>>>sock_diag_rcv+0x2a/0x40 net/core/sock_diag.c:272
>>>netlink_unicast_kernel net/netlink/af_netlink.c:1307 [inline]
>>>netlink_unicast+0x4c4/0x6b0 net/netlink/af_netlink.c:1333
>>>netlink_sendmsg+0xa4a/0xe80 net/netlink/af_netlink.c:1896
>>>sock_sendmsg_nosec net/socket.c:629 [inline]
>>>sock_sendmsg+0xca/0x110 net/socket.c:639
>>>sock_write_iter+0x31a/0x5d0 net/socket.c:908
>>>call_write_iter include/linux/fs.h:1782 [inline]
>>>new_sync_write fs/read_write.c:469 [inline]
>>>__vfs_write+0x684/0x970 fs/read_write.c:482
>>>vfs_write+0x189/0x510 fs/read_write.c:544
>>>SYSC_write fs/read_write.c:589 [inline]
>>>SyS_write+0xef/0x220 fs/read_write.c:581
>>>do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
>>>entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>>
>>> -> #0 (&af_unix_sk_receive_queue_lock_key){+.+.}:
>>>lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920
>>>__raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
>>>_raw_spin_lock_irqsave+0x96/0xc0 kernel/locking/spinlock.c:152
>>>skb_queue_tail+0x26/0x150 net/core/skbuff.c:2899
>>>unix_dgram_sendmsg+0xa30/0x1610 net/unix/af_unix.c:1807
>>>sock_sendmsg_nosec net/socket.c:629 [inline]
>>>sock_sendmsg+0xca/0x110 net/socket.c:639
>>>___sys_sendmsg+0x320/0x8b0 net/socket.c:2047
>>>__sys_sendmmsg+0x1ee/0x620 net/socket.c:2137
>>>SYSC_sendmmsg net/socket.c:2168 [inline]
>>>SyS_sendmmsg+0x35/0x60 net/socket.c:2163
>>>do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
>>>entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>
>> sk_diag_dump_icons() dumps only sockets in TCP_LISTEN state.
>> TCP_LISTEN state may be assigned in only place in net/unix/af_unix.c:
>> it's unix_listen(). The function is applied to stream and seqpacket
>> socket types.
>>
>> It can't be stream because of the second stack, and seqpacket also can't,
>> as I don't think it's possible for gcc to inline unix_seqpacket_sendmsg()
>> in the way, we don't see it in the stack.
>>
>> So, this is looks like false positive result for me.
>>
>> Kirill
> 
> Do you mean that these &(&u->lock)->rlock/1 referenced in 2 stacks are
> always different?

In these 2 particular stacks they have to be different.

But we may meet another stacks, where stream or seqpacket
functions are used instead of unix_dgram_sendmsg(), and

[PATCH net v6 1/4] ipv6: add a wrapper for ip6_dst_store() with flowi6 checks

2018-04-03 Thread Alexey Kodanev
Move commonly used pattern of ip6_dst_store() usage to a separate
function - ip6_sk_dst_store_flow(), which will check the addresses
for equality using the flow information, before saving them.

There is no functional changes in this patch. In addition, it will
be used in the next patch, in ip6_sk_dst_lookup_flow().

Signed-off-by: Alexey Kodanev 
---
 include/net/ip6_route.h |  3 +++
 net/ipv6/datagram.c |  9 +
 net/ipv6/route.c| 17 +
 3 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index ac0866b..abec280 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -210,6 +210,9 @@ static inline void ip6_dst_store(struct sock *sk, struct 
dst_entry *dst,
 #endif
 }
 
+void ip6_sk_dst_store_flow(struct sock *sk, struct dst_entry *dst,
+  const struct flowi6 *fl6);
+
 static inline bool ipv6_unicast_destination(const struct sk_buff *skb)
 {
struct rt6_info *rt = (struct rt6_info *) skb_dst(skb);
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index a9f7eca..8f6a391 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -106,14 +106,7 @@ int ip6_datagram_dst_update(struct sock *sk, bool 
fix_sk_saddr)
}
}
 
-   ip6_dst_store(sk, dst,
- ipv6_addr_equal(&fl6.daddr, &sk->sk_v6_daddr) ?
- &sk->sk_v6_daddr : NULL,
-#ifdef CONFIG_IPV6_SUBTREES
- ipv6_addr_equal(&fl6.saddr, &np->saddr) ?
- &np->saddr :
-#endif
- NULL);
+   ip6_sk_dst_store_flow(sk, dst, &fl6);
 
 out:
fl6_sock_release(flowlabel);
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index b0d5c64..b14008e 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2153,6 +2153,23 @@ void ip6_sk_update_pmtu(struct sk_buff *skb, struct sock 
*sk, __be32 mtu)
 }
 EXPORT_SYMBOL_GPL(ip6_sk_update_pmtu);
 
+void ip6_sk_dst_store_flow(struct sock *sk, struct dst_entry *dst,
+  const struct flowi6 *fl6)
+{
+#ifdef CONFIG_IPV6_SUBTREES
+   struct ipv6_pinfo *np = inet6_sk(sk);
+#endif
+
+   ip6_dst_store(sk, dst,
+ ipv6_addr_equal(&fl6->daddr, &sk->sk_v6_daddr) ?
+ &sk->sk_v6_daddr : NULL,
+#ifdef CONFIG_IPV6_SUBTREES
+ ipv6_addr_equal(&fl6->saddr, &np->saddr) ?
+ &np->saddr :
+#endif
+ NULL);
+}
+
 /* Handle redirects */
 struct ip6rd_flowi {
struct flowi6 fl6;
-- 
1.8.3.1



[PATCH net v6 4/4] ipv6: udp: set dst cache for a connected sk if current not valid

2018-04-03 Thread Alexey Kodanev
A new RTF_CACHE route can be created between ip6_sk_dst_lookup_flow()
and ip6_dst_store() calls in udpv6_sendmsg(), when datagram sending
results to ICMPV6_PKT_TOOBIG error:

udp_v6_send_skb(), for example with vti6 tunnel:
vti6_xmit(), get ICMPV6_PKT_TOOBIG error
skb_dst_update_pmtu(), can create a RTF_CACHE clone
icmpv6_send()
...
udpv6_err()
ip6_sk_update_pmtu()
   ip6_update_pmtu(), can create a RTF_CACHE clone
   ...
   ip6_datagram_dst_update()
ip6_dst_store()

And after commit 33c162a980fe ("ipv6: datagram: Update dst cache of
a connected datagram sk during pmtu update"), the UDPv6 error handler
can update socket's dst cache, but it can happen before the update in
the end of udpv6_sendmsg(), preventing getting the new dst cache on
the next udpv6_sendmsg() calls.

In order to fix it, save dst in a connected socket only if the current
socket's dst cache is invalid.

The previous patch prepared ip6_sk_dst_lookup_flow() to do that with
the new argument, and this patch enables it in udpv6_sendmsg().

Fixes: 33c162a980fe ("ipv6: datagram: Update dst cache of a connected datagram 
sk during pmtu update")
Fixes: 45e4fd26683c ("ipv6: Only create RTF_CACHE routes after encountering 
pmtu exception")
Signed-off-by: Alexey Kodanev 
---
 net/ipv6/udp.c | 21 ++---
 1 file changed, 2 insertions(+), 19 deletions(-)

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 4aa50ea..9b74092 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1289,7 +1289,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
 
fl6.flowlabel = ip6_make_flowinfo(ipc6.tclass, fl6.flowlabel);
 
-   dst = ip6_sk_dst_lookup_flow(sk, &fl6, final_p, false);
+   dst = ip6_sk_dst_lookup_flow(sk, &fl6, final_p, connected);
if (IS_ERR(dst)) {
err = PTR_ERR(dst);
dst = NULL;
@@ -1314,7 +1314,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
err = PTR_ERR(skb);
if (!IS_ERR_OR_NULL(skb))
err = udp_v6_send_skb(skb, &fl6);
-   goto release_dst;
+   goto out;
}
 
lock_sock(sk);
@@ -1348,23 +1348,6 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
err = np->recverr ? net_xmit_errno(err) : 0;
release_sock(sk);
 
-release_dst:
-   if (dst) {
-   if (connected) {
-   ip6_dst_store(sk, dst,
- ipv6_addr_equal(&fl6.daddr, 
&sk->sk_v6_daddr) ?
- &sk->sk_v6_daddr : NULL,
-#ifdef CONFIG_IPV6_SUBTREES
- ipv6_addr_equal(&fl6.saddr, &np->saddr) ?
- &np->saddr :
-#endif
- NULL);
-   } else {
-   dst_release(dst);
-   }
-   dst = NULL;
-   }
-
 out:
dst_release(dst);
fl6_sock_release(flowlabel);
-- 
1.8.3.1



[PATCH net v6 0/4] ipv6: udp: set dst cache for a connected sk if current not valid

2018-04-03 Thread Alexey Kodanev
A new RTF_CACHE route can be created with the socket's dst cache
update between the below calls in udpv6_sendmsg(), when datagram
sending results to ICMPV6_PKT_TOOBIG error:

   dst = ip6_sk_dst_lookup_flow(...)
   ...
release_dst:
if (dst) {
if (connected) {
ip6_dst_store(sk, dst)

Therefore, the new socket's dst cache reset to the old one on
"release_dst:".

The first three patches prepare the code to store dst cache
with ip6_sk_dst_lookup_flow():

  * the first patch adds ip6_sk_dst_store_flow() function with
commonly used source and destiantion addresses checks using
the flow information.

  * the second patch adds a new argument to ip6_sk_dst_lookup_flow()
and ability to store dst in the socket's cache. Also, the two
users of the function are updated without enabling the new
behavior: pingv6_sendmsg() and udpv6_sendmsg().

  * the third patch makes 'connected' variable in udpv6_sendmsg()
to be consistent with ip6_sk_dst_store_flow(), changes its type
from int to bool.

The last patch contains the actual fix that removes sk dst cache
update in the end of udpv6_sendmsg(), and allows to do it in
ip6_sk_dst_lookup_flow().

v6: * use bool type for a new parameter in ip_sk_dst_lookup_flow()
* add one more patch to convert 'connected' variable in
  udpv6_sendmsg() from int to bool type. If it shouldn't be
  here I will resend it when the net-next is opened.

v5: * relocate ip6_sk_dst_store_flow() to net/ipv6/route.c and
  rename ip6_dst_store_flow() to ip6_sk_dst_store_flow() as
  suggested by Martin

v4: * fix the error in the build of ip_dst_store_flow() reported by
  kbuild test robot due to missing checks for CONFIG_IPV6: add
  new function to ip6_output.c instead of ip6_route.h
* add 'const' to struct flowi6 in ip6_dst_store_flow()
* minor commit messages fixes

v3: * instead of moving ip6_dst_store() above udp_v6_send_skb(),
  update socket's dst cache inside ip6_sk_dst_lookup_flow()
  if the current one is invalid
* the issue not reproduced in 4.1, but starting from 4.2. Add
  one more 'Fixes:' commit that creates new RTF_CACHE route.
  Though, it is also mentioned in the first one

Alexey Kodanev (4):
  ipv6: add a wrapper for ip6_dst_store() with flowi6 checks
  ipv6: allow to cache dst for a connected sk in ip6_sk_dst_lookup_flow()
  ipv6: udp: convert 'connected' to bool type in udpv6_sendmsg()
  ipv6: udp: set dst cache for a connected sk if current not valid

 include/net/ip6_route.h |  3 +++
 include/net/ipv6.h  |  3 ++-
 net/ipv6/datagram.c |  9 +
 net/ipv6/ip6_output.c   | 15 ---
 net/ipv6/ping.c |  2 +-
 net/ipv6/route.c| 17 +
 net/ipv6/udp.c  | 31 +++
 7 files changed, 43 insertions(+), 37 deletions(-)

-- 
1.8.3.1



[PATCH net v6 2/4] ipv6: allow to cache dst for a connected sk in ip6_sk_dst_lookup_flow()

2018-04-03 Thread Alexey Kodanev
Add 'connected' parameter to ip6_sk_dst_lookup_flow() and update
the cache only if ip6_sk_dst_check() returns NULL and a socket
is connected.

The function is used as before, the new behavior for UDP sockets
in udpv6_sendmsg() will be enabled in the next patch.

Signed-off-by: Alexey Kodanev 
---
 include/net/ipv6.h|  3 ++-
 net/ipv6/ip6_output.c | 15 ---
 net/ipv6/ping.c   |  2 +-
 net/ipv6/udp.c|  2 +-
 4 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 8606c91..1d416f2 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -977,7 +977,8 @@ int ip6_dst_lookup(struct net *net, struct sock *sk, struct 
dst_entry **dst,
 struct dst_entry *ip6_dst_lookup_flow(const struct sock *sk, struct flowi6 
*fl6,
  const struct in6_addr *final_dst);
 struct dst_entry *ip6_sk_dst_lookup_flow(struct sock *sk, struct flowi6 *fl6,
-const struct in6_addr *final_dst);
+const struct in6_addr *final_dst,
+bool connected);
 struct dst_entry *ip6_blackhole_route(struct net *net,
  struct dst_entry *orig_dst);
 
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index a8a9195..46ea7b6 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1105,23 +1105,32 @@ struct dst_entry *ip6_dst_lookup_flow(const struct sock 
*sk, struct flowi6 *fl6,
  * @sk: socket which provides the dst cache and route info
  * @fl6: flow to lookup
  * @final_dst: final destination address for ipsec lookup
+ * @connected: whether @sk is connected or not
  *
  * This function performs a route lookup on the given flow with the
  * possibility of using the cached route in the socket if it is valid.
  * It will take the socket dst lock when operating on the dst cache.
  * As a result, this function can only be used in process context.
  *
+ * In addition, for a connected socket, cache the dst in the socket
+ * if the current cache is not valid.
+ *
  * It returns a valid dst pointer on success, or a pointer encoded
  * error code.
  */
 struct dst_entry *ip6_sk_dst_lookup_flow(struct sock *sk, struct flowi6 *fl6,
-const struct in6_addr *final_dst)
+const struct in6_addr *final_dst,
+bool connected)
 {
struct dst_entry *dst = sk_dst_check(sk, inet6_sk(sk)->dst_cookie);
 
dst = ip6_sk_dst_check(sk, dst, fl6);
-   if (!dst)
-   dst = ip6_dst_lookup_flow(sk, fl6, final_dst);
+   if (dst)
+   return dst;
+
+   dst = ip6_dst_lookup_flow(sk, fl6, final_dst);
+   if (connected && !IS_ERR(dst))
+   ip6_sk_dst_store_flow(sk, dst_clone(dst), fl6);
 
return dst;
 }
diff --git a/net/ipv6/ping.c b/net/ipv6/ping.c
index d12c55d..746eeae 100644
--- a/net/ipv6/ping.c
+++ b/net/ipv6/ping.c
@@ -121,7 +121,7 @@ static int ping_v6_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
ipc6.tclass = np->tclass;
fl6.flowlabel = ip6_make_flowinfo(ipc6.tclass, fl6.flowlabel);
 
-   dst = ip6_sk_dst_lookup_flow(sk, &fl6,  daddr);
+   dst = ip6_sk_dst_lookup_flow(sk, &fl6, daddr, false);
if (IS_ERR(dst))
return PTR_ERR(dst);
rt = (struct rt6_info *) dst;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 52e3ea0..5bc102b2 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1289,7 +1289,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
 
fl6.flowlabel = ip6_make_flowinfo(ipc6.tclass, fl6.flowlabel);
 
-   dst = ip6_sk_dst_lookup_flow(sk, &fl6, final_p);
+   dst = ip6_sk_dst_lookup_flow(sk, &fl6, final_p, false);
if (IS_ERR(dst)) {
err = PTR_ERR(dst);
dst = NULL;
-- 
1.8.3.1



[PATCH net v6 3/4] ipv6: udp: convert 'connected' to bool type in udpv6_sendmsg()

2018-04-03 Thread Alexey Kodanev
This should make it consistent with ip6_sk_dst_lookup_flow()
that is accepting the new 'connected' parameter of type bool.

Signed-off-by: Alexey Kodanev 
---
 net/ipv6/udp.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 5bc102b2..4aa50ea 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1097,10 +1097,10 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
struct dst_entry *dst;
struct ipcm6_cookie ipc6;
int addr_len = msg->msg_namelen;
+   bool connected = false;
int ulen = len;
int corkreq = up->corkflag || msg->msg_flags&MSG_MORE;
int err;
-   int connected = 0;
int is_udplite = IS_UDPLITE(sk);
int (*getfrag)(void *, char *, int, int, int, struct sk_buff *);
struct sockcm_cookie sockc;
@@ -1222,7 +1222,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
fl6.fl6_dport = inet->inet_dport;
daddr = &sk->sk_v6_daddr;
fl6.flowlabel = np->flow_label;
-   connected = 1;
+   connected = true;
}
 
if (!fl6.flowi6_oif)
@@ -1252,7 +1252,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
}
if (!(opt->opt_nflen|opt->opt_flen))
opt = NULL;
-   connected = 0;
+   connected = false;
}
if (!opt) {
opt = txopt_get(np);
@@ -1274,11 +1274,11 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
 
final_p = fl6_update_dst(&fl6, opt, &final);
if (final_p)
-   connected = 0;
+   connected = false;
 
if (!fl6.flowi6_oif && ipv6_addr_is_multicast(&fl6.daddr)) {
fl6.flowi6_oif = np->mcast_oif;
-   connected = 0;
+   connected = false;
} else if (!fl6.flowi6_oif)
fl6.flowi6_oif = np->ucast_oif;
 
-- 
1.8.3.1



Re: [PATCH] vhost-net: add limitation of sent packets for tx polling

2018-04-03 Thread Michael S. Tsirkin
On Tue, Apr 03, 2018 at 08:08:26AM +, haibinzhang(张海斌) wrote:
> handle_tx will delay rx for a long time when tx busy polling udp packets
> with small length(e.g. 1byte udp payload), because setting VHOST_NET_WEIGHT
> takes into account only sent-bytes but no single packet length.
> 
> Tests were done between two Virtual Machines using netperf(UDP_STREAM, len=1),
> then another machine pinged the client. Result shows as follow:
> 
> Packet#   Ping-Latency(ms)
>   min avg max
> Origin  3.319  18.489  57.503
> 64  1.643   2.021   2.552
> 128 1.825   2.600   3.224
> 256 1.997   2.710   4.295
> 512*1.860   3.171   4.631
> 10242.002   4.173   9.056
> 20482.257   5.650   9.688
> 40962.093   8.508  15.943
> 
> 512 is selected, which is multi-VRING_SIZE

There's no guarantee vring size is 256.

Could you pls try with a different tx ring size?

I suspect we want:

#define VHOST_NET_PKT_WEIGHT(vq) ((vq)->num * 2)


> and close to VHOST_NET_WEIGHT/MTU.

Puzzled by this part.  Does tweaking MTU change anything?

> To evaluate this change, another tests were done using netperf(RR, TX) between
> two machines with Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz. Result as follow
> does not show obvious changes:
> 
> TCP_RR
> 
> size/sessions/+thu%/+normalize%
>1/   1/  -7%/-2%
>1/   4/  +1%/ 0%
>1/   8/  +1%/-2%
>   64/   1/  -6%/ 0%
>   64/   4/   0%/+2%
>   64/   8/   0%/ 0%
>  256/   1/  -3%/-4%
>  256/   4/  +3%/+4%
>  256/   8/  +2%/ 0%
> 
> UDP_RR
> 
> size/sessions/+thu%/+normalize%
>1/   1/  -5%/+1%
>1/   4/  +4%/+1%
>1/   8/  -1%/-1%
>   64/   1/  -2%/-3%
>   64/   4/  -5%/-1%
>   64/   8/   0%/-1%
>  256/   1/  +7%/+1%
>  256/   4/  +1%/+1%
>  256/   8/  +2%/+2%
> 
> TCP_STREAM
> 
> size/sessions/+thu%/+normalize%
>   64/   1/   0%/-3%
>   64/   4/  +3%/-1%
>   64/   8/  +9%/-4%
>  256/   1/  +1%/-4%
>  256/   4/  -1%/-1%
>  256/   8/  +7%/+5%
>  512/   1/  +1%/ 0%
>  512/   4/  +1%/-1%
>  512/   8/  +7%/-5%
> 1024/   1/   0%/-1%
> 1024/   4/  +3%/ 0%
> 1024/   8/  +8%/+5%
> 2048/   1/  +2%/+2%
> 2048/   4/  +1%/ 0%
> 2048/   8/  -2%/ 0%
> 4096/   1/  -2%/ 0%
> 4096/   4/  +2%/ 0%
> 4096/   8/  +9%/-2%
> 
> Signed-off-by: Haibin Zhang 
> Signed-off-by: Yunfang Tai 
> Signed-off-by: Lidong Chen 
> ---
>  drivers/vhost/net.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 8139bc70ad7d..13a23f3f3ea4 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -44,6 +44,10 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy 
> TX;"
>   * Using this limit prevents one virtqueue from starving others. */
>  #define VHOST_NET_WEIGHT 0x8
>  
> +/* Max number of packets transferred before requeueing the job.
> + * Using this limit prevents one virtqueue from starving rx. */
> +#define VHOST_NET_PKT_WEIGHT 512
> +
>  /* MAX number of TX used buffers for outstanding zerocopy */
>  #define VHOST_MAX_PEND 128
>  #define VHOST_GOODCOPY_LEN 256
> @@ -473,6 +477,7 @@ static void handle_tx(struct vhost_net *net)
>   struct socket *sock;
>   struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
>   bool zcopy, zcopy_used;
> + int sent_pkts = 0;
>  
>   mutex_lock(&vq->mutex);
>   sock = vq->private_data;
> @@ -580,7 +585,8 @@ static void handle_tx(struct vhost_net *net)
>   else
>   vhost_zerocopy_signal_used(net, vq);
>   vhost_net_tx_packet(net);
> - if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> + if (unlikely(total_len >= VHOST_NET_WEIGHT) ||
> + unlikely(++sent_pkts >= VHOST_NET_PKT_WEIGHT)) {
>   vhost_poll_queue(&vq->poll);
>   break;
>   }
> -- 
> 2.12.3
> 


[PATCH] kernel/bpf/syscall: fix warning defined but not used

2018-04-03 Thread Anders Roxell
There will be a build warning -Wunused-function if CONFIG_CGROUP_BPF
isn't defined, since the only user is inside #ifdef CONFIG_CGROUP_BPF:
kernel/bpf/syscall.c:1229:12: warning: ‘bpf_prog_attach_check_attach_type’
defined but not used [-Wunused-function]
 static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
^

Current code moves function bpf_prog_attach_check_attach_type inside
ifdef CONFIG_CGROUP_BPF.

Fixes: 5e43f899b03a ("bpf: Check attach type at prog load time")
Signed-off-by: Anders Roxell 
---
 kernel/bpf/syscall.c | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 7457f2676c6d..56f49557adda 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1226,18 +1226,6 @@ bpf_prog_load_check_attach_type(enum bpf_prog_type 
prog_type,
}
 }
 
-static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
-enum bpf_attach_type attach_type)
-{
-   switch (prog->type) {
-   case BPF_PROG_TYPE_CGROUP_SOCK:
-   case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
-   return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
-   default:
-   return 0;
-   }
-}
-
 /* last field in 'union bpf_attr' used by this command */
 #defineBPF_PROG_LOAD_LAST_FIELD expected_attach_type
 
@@ -1465,6 +1453,18 @@ static int bpf_raw_tracepoint_open(const union bpf_attr 
*attr)
 
 #ifdef CONFIG_CGROUP_BPF
 
+static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
+enum bpf_attach_type attach_type)
+{
+   switch (prog->type) {
+   case BPF_PROG_TYPE_CGROUP_SOCK:
+   case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
+   return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
+   default:
+   return 0;
+   }
+}
+
 #define BPF_PROG_ATTACH_LAST_FIELD attach_flags
 
 static int sockmap_get_from_fd(const union bpf_attr *attr,
-- 
2.16.2



Re: [RFC PATCH 3/3] virtio_net: make lower netdevs for virtio_bypass hidden

2018-04-03 Thread Michael S. Tsirkin
On Sun, Apr 01, 2018 at 05:13:10AM -0400, Si-Wei Liu wrote:
> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
> index aa40664..0827b7e 100644
> --- a/include/uapi/linux/virtio_net.h
> +++ b/include/uapi/linux/virtio_net.h
> @@ -80,6 +80,8 @@ struct virtio_net_config {
>   __u16 max_virtqueue_pairs;
>   /* Default maximum transmit unit advice */
>   __u16 mtu;
> + /* Device at bus:slot.function backed up by virtio_net */
> + __u16 bsf2backup;
>  } __attribute__((packed));

I'm not sure this is a good interface.  This isn't unique even on some
PCI systems, not to speak of non-PCI ones.

>  /*
> -- 
> 1.8.3.1


Re: [RFC PATCH 1/3] qemu: virtio-bypass should explicitly bind to a passthrough device

2018-04-03 Thread Michael S. Tsirkin
On Sun, Apr 01, 2018 at 05:13:08AM -0400, Si-Wei Liu wrote:
> @@ -896,6 +898,68 @@ void qmp_device_del(const char *id, Error **errp)
>  }
>  }
>  
> +int pci_get_busdevfn_by_id(const char *id, uint16_t *busnr,
> +   uint16_t *devfn, Error **errp)
> +{
> +uint16_t busnum = 0, slot = 0, func = 0;
> +const char *pc, *pd, *pe;
> +Error *local_err = NULL;
> +ObjectClass *class;
> +char value[1024];
> +BusState *bus;
> +uint64_t u64;
> +
> +if (!(pc = strchr(id, ':'))) {
> +error_setg(errp, "Invalid id: backup=%s, "
> +   "correct format should be backup="
> +   "':[.]'", id);
> +return -1;
> +}
> +get_opt_name(value, sizeof(value), id, ':');
> +if (pc != id + 1) {
> +bus = qbus_find(value, errp);
> +if (!bus)
> +return -1;
> +
> +class = object_get_class(OBJECT(bus));
> +if (class != object_class_by_name(TYPE_PCI_BUS) &&
> +class != object_class_by_name(TYPE_PCIE_BUS)) {
> +error_setg(errp, "%s is not a device on pci bus", id);
> +return -1;
> +}
> +busnum = (uint16_t)pci_bus_num(PCI_BUS(bus));
> +}

pci_bus_num is almost always a bug if not done within
a context of a PCI host, bridge, etc.

In particular this will not DTRT if run before guest assigns bus
numbers.


> +
> +if (!devfn)
> +goto out;
> +
> +pd = strchr(pc, '.');
> +pe = get_opt_name(value, sizeof(value), pc + 1, '.');
> +if (pe != pc + 1) {
> +parse_option_number("slot", value, &u64, &local_err);
> +if (local_err) {
> +error_propagate(errp, local_err);
> +return -1;
> +}
> +slot = (uint16_t)u64;
> +}
> +if (pd && *(pd + 1) != '\0') {
> +parse_option_number("function", pd, &u64, &local_err);
> +if (local_err) {
> +error_propagate(errp, local_err);
> +return -1;
> +}
> +func = (uint16_t)u64;
> +}
> +
> +out:
> +if (busnr)
> +*busnr = busnum;
> +if (devfn)
> +*devfn = ((slot & 0x1F) << 3) | (func & 0x7);
> +return 0;
> +}
> +
>  BlockBackend *blk_by_qdev_id(const char *id, Error **errp)
>  {
>  DeviceState *dev;
> -- 
> 1.8.3.1


Re: [PATCH] vhost-net: add limitation of sent packets for tx polling

2018-04-03 Thread 张海斌

>On Tue, Apr 03, 2018 at 08:08:26AM +, haibinzhang wrote:
>> handle_tx will delay rx for a long time when tx busy polling udp packets
>> with small length(e.g. 1byte udp payload), because setting VHOST_NET_WEIGHT
>> takes into account only sent-bytes but no single packet length.
>> 
>> Tests were done between two Virtual Machines using netperf(UDP_STREAM, 
>> len=1),
>> then another machine pinged the client. Result shows as follow:
>> 
>> Packet#   Ping-Latency(ms)
>>   min avg max
>> Origin  3.319  18.489  57.503
>> 64  1.643   2.021   2.552
>> 128 1.825   2.600   3.224
>> 256 1.997   2.710   4.295
>> 512*1.860   3.171   4.631
>> 10242.002   4.173   9.056
>> 20482.257   5.650   9.688
>> 40962.093   8.508  15.943
>> 
>> 512 is selected, which is multi-VRING_SIZE
>
>There's no guarantee vring size is 256.
>
>Could you pls try with a different tx ring size?
>
>I suspect we want:
>
>#define VHOST_NET_PKT_WEIGHT(vq) ((vq)->num * 2)
>
>
>> and close to VHOST_NET_WEIGHT/MTU.
>
>Puzzled by this part.  Does tweaking MTU change anything?

The MTU of ethernet is 1500, so VHOST_NET_WEIGHT/MTU equals 0x8/1500=350.
Then sent-bytes cannot reach VHOST_NET_WEIGHT in one handle_tx even with 
1500-bytes 
frame if packet# is less than 350. So packet# must be bigger than 350.
512 meets this condition and is also DEFAULT VRING_SIZE aligned.

>
>> To evaluate this change, another tests were done using netperf(RR, TX) 
>> between
>> two machines with Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz. Result as follow
>> does not show obvious changes:
>> 
>> TCP_RR
>> 
>> size/sessions/+thu%/+normalize%
>>1/   1/  -7%/-2%
>>1/   4/  +1%/ 0%
>>1/   8/  +1%/-2%
>>   64/   1/  -6%/ 0%
>>   64/   4/   0%/+2%
>>   64/   8/   0%/ 0%
>>  256/   1/  -3%/-4%
>>  256/   4/  +3%/+4%
>>  256/   8/  +2%/ 0%
>> 
>> UDP_RR
>> 
>> size/sessions/+thu%/+normalize%
>>1/   1/  -5%/+1%
>>1/   4/  +4%/+1%
>>1/   8/  -1%/-1%
>>   64/   1/  -2%/-3%
>>   64/   4/  -5%/-1%
>>   64/   8/   0%/-1%
>>  256/   1/  +7%/+1%
>>  256/   4/  +1%/+1%
>>  256/   8/  +2%/+2%
>> 
>> TCP_STREAM
>> 
>> size/sessions/+thu%/+normalize%
>>   64/   1/   0%/-3%
>>   64/   4/  +3%/-1%
>>   64/   8/  +9%/-4%
>>  256/   1/  +1%/-4%
>>  256/   4/  -1%/-1%
>>  256/   8/  +7%/+5%
>>  512/   1/  +1%/ 0%
>>  512/   4/  +1%/-1%
>>  512/   8/  +7%/-5%
>> 1024/   1/   0%/-1%
>> 1024/   4/  +3%/ 0%
>> 1024/   8/  +8%/+5%
>> 2048/   1/  +2%/+2%
>> 2048/   4/  +1%/ 0%
>> 2048/   8/  -2%/ 0%
>> 4096/   1/  -2%/ 0%
>> 4096/   4/  +2%/ 0%
>> 4096/   8/  +9%/-2%
>> 
>> Signed-off-by: Haibin Zhang 
>> Signed-off-by: Yunfang Tai 
>> Signed-off-by: Lidong Chen 
>> ---
>>  drivers/vhost/net.c | 8 +++-
>>  1 file changed, 7 insertions(+), 1 deletion(-)
>> 
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index 8139bc70ad7d..13a23f3f3ea4 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -44,6 +44,10 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy 
>> TX;"
>>   * Using this limit prevents one virtqueue from starving others. */
>>  #define VHOST_NET_WEIGHT 0x8
>>  
>> +/* Max number of packets transferred before requeueing the job.
>> + * Using this limit prevents one virtqueue from starving rx. */
>> +#define VHOST_NET_PKT_WEIGHT 512
>> +
>>  /* MAX number of TX used buffers for outstanding zerocopy */
>>  #define VHOST_MAX_PEND 128
>>  #define VHOST_GOODCOPY_LEN 256
>> @@ -473,6 +477,7 @@ static void handle_tx(struct vhost_net *net)
>>  struct socket *sock;
>>  struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
>>  bool zcopy, zcopy_used;
>> +int sent_pkts = 0;
>>  
>>  mutex_lock(&vq->mutex);
>>  sock = vq->private_data;
>> @@ -580,7 +585,8 @@ static void handle_tx(struct vhost_net *net)
>>  else
>>  vhost_zerocopy_signal_used(net, vq);
>>  vhost_net_tx_packet(net);
>> -if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
>> +if (unlikely(total_len >= VHOST_NET_WEIGHT) ||
>> +unlikely(++sent_pkts >= VHOST_NET_PKT_WEIGHT)) {
>>  vhost_poll_queue(&vq->poll);
>>  break;
>>  }
>> -- 
>> 2.12.3
>> 



Re: WARNING in add_uevent_var

2018-04-03 Thread Johannes Berg
On Sun, 2018-04-01 at 23:01 -0700, syzbot wrote:

> So far this crash happened 5 times on net-next, upstream.
> C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6614377067184128
> 

Huh, fun. Looks like you're basically creating a new HWSIM radio with an
insanely long name (4k!) and nothing stops you, until we try to generate
an rfkill instance which sends a uevent and only has a 2k buffer for the
environment variables, where we put the name ...

But yeah, we should probably limit the phy name to something sane, I'll
pick 128 and send a patch.

johannes


[RFC PATCH net] tcp: allow to use TCP Fastopen with MSG_ZEROCOPY

2018-04-03 Thread Alexey Kodanev
With TCP Fastopen we can have the following cases, which could also
use MSG_ZEROCOPY flag with send() and sendto():

* sendto() + MSG_FASTOPEN flag, sk state can be in TCP_CLOSE at
  the start of tcp_sendmsg()

* set socket option TCP_FASTOPEN_CONNECT, then connect()
  and send(), sk state in TCP_SYN_SENT

Currently, both cases with tcp_sendmsg() and MSG_ZEROCOPY flag results
to EINVAL error, because of the check for TCP_ESTABLISHED sk state in
the beginning of tcp_sendmsg().

Both conditions require two more checks there: !tp->fastopen_connect
and !(flags & MSG_FASTOPEN). It looks like we could remove the original
check altogether for this unlikely event instead. That way tcp_sendmsg()
without TFO should fail with EPIPE on sk_stream_wait_connect(), as
before the introduction of MSG_ZEROCOPY there. And work smoothly for
the TFO cases.

Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
Signed-off-by: Alexey Kodanev 
---

Is there something that I've overlooked and we can't use it here, and
we should handle this type of error, while using sendto() + TFO,
in userspace?

 net/ipv4/tcp.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 9225610..768f02c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1193,11 +1193,6 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr 
*msg, size_t size)
flags = msg->msg_flags;
 
if (flags & MSG_ZEROCOPY && size) {
-   if (sk->sk_state != TCP_ESTABLISHED) {
-   err = -EINVAL;
-   goto out_err;
-   }
-
skb = tcp_write_queue_tail(sk);
uarg = sock_zerocopy_realloc(sk, size, skb_zcopy(skb));
if (!uarg) {
-- 
1.8.3.1



Re: [PATCH net-next v2 1/2] fs/crashdd: add API to collect hardware dump in second kernel

2018-04-03 Thread Andrew Lunn
On Tue, Apr 03, 2018 at 08:43:27AM +0300, Alex Vesker wrote:
> 
> 
> On 4/2/2018 12:12 PM, Jiri Pirko wrote:
> >Fri, Mar 30, 2018 at 05:11:29PM CEST, and...@lunn.ch wrote:
> >>>Please see:
> >>>http://patchwork.ozlabs.org/project/netdev/list/?series=36524
> >>>
> >>>I bevieve that the solution in the patchset could be used for
> >>>your usecase too.
> >>Hi Jiri
> >>
> >>https://lkml.org/lkml/2018/3/20/436
> >>
> >>How well does this API work for a 2Gbyte snapshot?
> >Ccing Alex who did the tests.
> 
> I didn't check the performance for such a large snapshot.
> From my measurement it takes 0.09s for 1 MB of data this means
> about ~3m.

I was not really thinking about performance. More about how well does
the system work when you ask the kernel for 2GB of RAM to put a
snapshot into? And given your current design, you need another 2GB
buffer for the driver to use before calling this new API.

So i'm asking, how well does this API scale?

I think you need to remove the need for a second buffer in the
driver. Either the driver allocates the buffer and hands it over, or
your core code allocates the buffer and gives it to the driver to
fill. Maybe look at what makes most sense for the crash dump code?

  Andrew


Re: linux-next: build failure after merge of the tip tree

2018-04-03 Thread David Howells
Peter Zijlstra  wrote:

> I figured that since there were only a handful of users it wasn't a
> popular API, also David very much knew of those patches changing it so
> could easily have pulled in the special tip/sched/wait branch :/

I'm not sure I could, since I have to base on net-next.  I'm not sure what
DaveM's policy on that is.

Also, it might've been better not to simply erase the atomic_t wait API
immediately, but substitute wrappers for it to be removed one iteration hence.

David


Re: linux-next: build failure after merge of the tip tree

2018-04-03 Thread David Howells
Stephen Rothwell  wrote:

> + wait_var_event(&rxnet->nr_calls, !atomic_read(&rxnet->nr_calls));

I would prefer == 0 to ! as it's not really a true/false value.

But apart from that, it's looks okay and you can add my Reviewed-by.

David


Re: linux-next: build failure after merge of the tip tree

2018-04-03 Thread Peter Zijlstra
On Tue, Apr 03, 2018 at 01:39:08PM +0100, David Howells wrote:
> Peter Zijlstra  wrote:
> 
> > I figured that since there were only a handful of users it wasn't a
> > popular API, also David very much knew of those patches changing it so
> > could easily have pulled in the special tip/sched/wait branch :/
> 
> I'm not sure I could, since I have to base on net-next.  I'm not sure what
> DaveM's policy on that is.
> 
> Also, it might've been better not to simply erase the atomic_t wait API
> immediately, but substitute wrappers for it to be removed one iteration hence.

Yeah, I know, but I really wasn't expecting new users of this thing, it
seemed like quite an exotic API with very limited users.

A well..



Re: [PATCH v3 2/4] bus: fsl-mc: add restool userspace support

2018-04-03 Thread Andrew Lunn
On Tue, Apr 03, 2018 at 11:12:52AM +, Razvan Stefanescu wrote:
> DPAA2 offers several object-based abstractions for modeling network
> related devices (interfaces, L2 Ethernet switch) or accelerators
> (DPSECI - crypto and DPDCEI - compression), the latter not up-streamed yet.
> They are modeled using various low-level resources (e.g. queues,
> classification tables, physical ports) and have multiple configuration and
> interconnectivity options, managed by the Management Complex. 
> Resources are limited and they are only used when needed by the objects,
> to accommodate more configurations and usage scenarios.
>  
> Some of the objects have a 1-to-1 correspondence to physical resources
> (e.g. DPMACs to physical ports), while others (like DPNIs and DPSW)
> can be seen as a collection of the mentioned resources. The types and 
> number of such objects are not predetermined.
> 
> When the board boots up, none of them exist yet. Restool allows a user to
> define the system topology, by providing a way to dynamically create, destroy
> and interconnect these objects.

Hi Razvan

The core concept with Linux networking and offload is that the
hardware is there to accelerate what Linux can already do. Since Linux
can already do it, i don't need any additional tools.

You have new hardware. It might offer features which we currently
don't have offload support for. But all the means is you need to
extend the core networking code which implements the software version
of that feature to offload to the hardware.

The board knows how many physical ports it has. switchdev can then
setup the plumbing to create the objects needed to represent the
ports. Restool is not needed for that.

> In the latter case, the two DPNIs will not be connected to any physical
> port, but can be used as a point-to-point connection between two virtual
> machines for instance.
 
Can Linux already do this? Isn't that what PCI Virtual Functions are
all about? You need to find the current Linux concept for this, and
extend it to offload the functionality to hardware. If Linux can do
it, it already has the tools to configure it. Restool is not needed
for that.

> So, it is not possible to connect a DPNI to a DPSW after it was
> connected to a DPMAC. The DPNI-DPMAC pair would have to be
> disconnected and DPMAC will be reconnected to the switch. DPNI
> interface that is no longer connected to a DPMAC will be destroyed
> and any new addition/deletion of a DPNI/DPMAC interface to the
> switch port will trigger the entire switch re-configuration.

Switches and ports connected to switches are dynamic. They come and
go. You don't expect it to happen very often, but Linux has no
restrictions on this. You need to figure out how best to offload this
to your hardware. Maybe when you create the switch object you make a
guess as to how many ports you need. Leave some of the ports not
connected to anything. You can then add ports to the switch using the
free ports. If you run out of ports, you have no choice but to destroy
the switch object and create a new one. Hopefully that does not take
too long. Restool is not needed for this, it all happens within the
switchdev driver.

  Andrew


[PATCH iproute2-next 1/1] tc: jsonify connmark action

2018-04-03 Thread Roman Mashak
Signed-off-by: Roman Mashak 
---
 tc/m_connmark.c | 16 ++--
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/tc/m_connmark.c b/tc/m_connmark.c
index bcce41391398..45e2d05f1a91 100644
--- a/tc/m_connmark.c
+++ b/tc/m_connmark.c
@@ -114,16 +114,20 @@ static int print_connmark(struct action_util *au, FILE 
*f, struct rtattr *arg)
 
parse_rtattr_nested(tb, TCA_CONNMARK_MAX, arg);
if (tb[TCA_CONNMARK_PARMS] == NULL) {
-   fprintf(f, "[NULL connmark parameters]");
+   print_string(PRINT_FP, NULL, "%s", "[NULL connmark 
parameters]");
return -1;
}
 
ci = RTA_DATA(tb[TCA_CONNMARK_PARMS]);
 
-   fprintf(f, " connmark zone %d", ci->zone);
-   print_action_control(f, " ", ci->action, "\n");
-   fprintf(f, "\t index %u ref %d bind %d", ci->index,
-   ci->refcnt, ci->bindcnt);
+   print_string(PRINT_ANY, "kind", "%s ", "connmark");
+   print_uint(PRINT_ANY, "zone", "zone %u", ci->zone);
+   print_action_control(f, " ", ci->action, "");
+
+   print_string(PRINT_FP, NULL, "%s", _SL_);
+   print_uint(PRINT_ANY, "index", "\t index %u", ci->index);
+   print_int(PRINT_ANY, "ref", " ref %d", ci->refcnt);
+   print_int(PRINT_ANY, "bind", " bind %d", ci->bindcnt);
 
if (show_stats) {
if (tb[TCA_CONNMARK_TM]) {
@@ -132,7 +136,7 @@ static int print_connmark(struct action_util *au, FILE *f, 
struct rtattr *arg)
print_tm(f, tm);
}
}
-   fprintf(f, "\n");
+   print_string(PRINT_FP, NULL, "%s", _SL_);
 
return 0;
 }
-- 
2.7.4



Re: [virtio-dev] [pci PATCH v7 2/5] virtio_pci: Add support for unmanaged SR-IOV on virtio_pci devices

2018-04-03 Thread Michael S. Tsirkin
On Thu, Mar 15, 2018 at 11:42:41AM -0700, Alexander Duyck wrote:
> From: Alexander Duyck 
> 
> Hardware-realized virtio_pci devices can implement SR-IOV, so this
> patch enables its use. The device in question is an upcoming Intel
> NIC that implements both a virtio_net PF and virtio_net VFs. These
> are hardware realizations of what has been up to now been a software
> interface.
> 
> The device in question has the following 4-part PCI IDs:
> 
> PF: vendor: 1af4 device: 1041 subvendor: 8086 subdevice: 15fe
> VF: vendor: 1af4 device: 1041 subvendor: 8086 subdevice: 05fe
> 
> The patch currently needs no check for device ID, because the callback
> will never be made for devices that do not assert the capability or
> when run on a platform incapable of SR-IOV.
> 
> One reason for this patch is because the hardware requires the
> vendor ID of a VF to be the same as the vendor ID of the PF that
> created it. So it seemed logical to simply have a fully-functioning
> virtio_net PF create the VFs. This patch makes that possible.
> 
> Reviewed-by: Christoph Hellwig 
> Signed-off-by: Mark Rustad 
> Signed-off-by: Alexander Duyck 

I thought hard about this, and I think we need a feature
bit for this. This way host can detect support,
and we can also change our minds later if we need
to modify the interface and manage VFs after all.

It seems PCI specific so non pci transports would disable the feature
for now.

> ---
> 
> v4: Dropped call to pci_disable_sriov in virtio_pci_remove function
> v5: Replaced call to pci_sriov_configure_unmanaged with
> pci_sriov_configure_simple
> v6: Dropped "#ifdef" checks for IOV wrapping sriov_configure definition
> v7: No code change, added Reviewed-by
> 
>  drivers/virtio/virtio_pci_common.c |1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/virtio/virtio_pci_common.c 
> b/drivers/virtio/virtio_pci_common.c
> index 48d4d1cf1cb6..67a227fd7aa0 100644
> --- a/drivers/virtio/virtio_pci_common.c
> +++ b/drivers/virtio/virtio_pci_common.c
> @@ -596,6 +596,7 @@ static void virtio_pci_remove(struct pci_dev *pci_dev)
>  #ifdef CONFIG_PM_SLEEP
>   .driver.pm  = &virtio_pci_pm_ops,
>  #endif
> + .sriov_configure = pci_sriov_configure_simple,
>  };
>  
>  module_pci_driver(virtio_pci_driver);
> 
> 
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org


Re: [PATCH net-next RFC V1 5/5] net: mdio: Add a driver for InES time stamping IP core.

2018-04-03 Thread Andrew Lunn
> On Mon, Apr 02, 2018 at 08:55:27PM -0700, Richard Cochran wrote:
> On Sun, Mar 25, 2018 at 04:01:49PM -0700, Florian Fainelli wrote:
> > The best that I can think about and it still is a hack in some way, is
> > to you have your time stamping driver create a proxy mii_bus whose
> > purpose is just to hook to mdio/phy_device events (such as link changes)
> > in order to do what is necessary, or at least, this would indicate its
> > transparent nature towards the MDIO/MDC lines...
> 
> That won't work at all, AFAICT.  There is only one mii_bus per netdev,
> that is one that is attached to the phydev.


Hi Richard

Have you tried implementing it using a phandle in the phy node,
pointing to the time stamping device?

I think it makes a much better architecture.

  Andrew


Re: [virtio-dev] [pci PATCH v7 2/5] virtio_pci: Add support for unmanaged SR-IOV on virtio_pci devices

2018-04-03 Thread Michael S. Tsirkin
On Fri, Mar 16, 2018 at 09:40:34AM -0700, Alexander Duyck wrote:
> On Fri, Mar 16, 2018 at 9:34 AM, Michael S. Tsirkin  wrote:
> > On Thu, Mar 15, 2018 at 11:42:41AM -0700, Alexander Duyck wrote:
> >> From: Alexander Duyck 
> >>
> >> Hardware-realized virtio_pci devices can implement SR-IOV, so this
> >> patch enables its use. The device in question is an upcoming Intel
> >> NIC that implements both a virtio_net PF and virtio_net VFs. These
> >> are hardware realizations of what has been up to now been a software
> >> interface.
> >>
> >> The device in question has the following 4-part PCI IDs:
> >>
> >> PF: vendor: 1af4 device: 1041 subvendor: 8086 subdevice: 15fe
> >> VF: vendor: 1af4 device: 1041 subvendor: 8086 subdevice: 05fe
> >>
> >> The patch currently needs no check for device ID, because the callback
> >> will never be made for devices that do not assert the capability or
> >> when run on a platform incapable of SR-IOV.
> >>
> >> One reason for this patch is because the hardware requires the
> >> vendor ID of a VF to be the same as the vendor ID of the PF that
> >> created it. So it seemed logical to simply have a fully-functioning
> >> virtio_net PF create the VFs. This patch makes that possible.
> >>
> >> Reviewed-by: Christoph Hellwig 
> >> Signed-off-by: Mark Rustad 
> >> Signed-off-by: Alexander Duyck 
> >
> > So if and when virtio PFs can manage the VFs, then we can
> > add a feature bit for that?
> > Seems reasonable.
> 
> Yes. If nothing else you may not even need a feature bit depending on
> how things go.

OTOH if the interface is changed in an incompatible way,
and old Linux will attempt to drive the new device
since there is no check.

I think we should add a feature bit right away.


> One of the reasons why Mark called out the
> subvendor/subdevice was because that might be able to be used to
> identify the specific hardware that is providing the SR-IOV feature so
> in the future if it is added to virtio itself then you could exclude
> devices like this by just limiting things based on subvendor/subdevice
> IDs.
> 
> > Also, I am guessing that hardware implementations will want
> > to add things like stong memory barriers - I guess we
> > will add new feature bits for that too down the road?
> 
> That piece I don't have visibility into at this time. Perhaps Dan
> might have more visibility into future plans on what this might need.
> 
> Thanks.
> 
> - Alex


Re: [PATCH] vhost-net: add limitation of sent packets for tx polling

2018-04-03 Thread Michael S. Tsirkin
On Tue, Apr 03, 2018 at 12:29:47PM +, haibinzhang(张海斌) wrote:
> 
> >On Tue, Apr 03, 2018 at 08:08:26AM +, haibinzhang wrote:
> >> handle_tx will delay rx for a long time when tx busy polling udp packets
> >> with small length(e.g. 1byte udp payload), because setting VHOST_NET_WEIGHT
> >> takes into account only sent-bytes but no single packet length.
> >> 
> >> Tests were done between two Virtual Machines using netperf(UDP_STREAM, 
> >> len=1),
> >> then another machine pinged the client. Result shows as follow:
> >> 
> >> Packet#   Ping-Latency(ms)
> >>   min avg max
> >> Origin  3.319  18.489  57.503
> >> 64  1.643   2.021   2.552
> >> 128 1.825   2.600   3.224
> >> 256 1.997   2.710   4.295
> >> 512*1.860   3.171   4.631
> >> 10242.002   4.173   9.056
> >> 20482.257   5.650   9.688
> >> 40962.093   8.508  15.943
> >> 
> >> 512 is selected, which is multi-VRING_SIZE
> >
> >There's no guarantee vring size is 256.
> >
> >Could you pls try with a different tx ring size?
> >
> >I suspect we want:
> >
> >#define VHOST_NET_PKT_WEIGHT(vq) ((vq)->num * 2)
> >
> >
> >> and close to VHOST_NET_WEIGHT/MTU.
> >
> >Puzzled by this part.  Does tweaking MTU change anything?
> 
> The MTU of ethernet is 1500, so VHOST_NET_WEIGHT/MTU equals 0x8/1500=350.

We should include the 12 byte header so it's a bit lower.

> Then sent-bytes cannot reach VHOST_NET_WEIGHT in one handle_tx even with 
> 1500-bytes 
> frame if packet# is less than 350. So packet# must be bigger than 350.
> 512 meets this condition

What you seem to say is this:

imagine MTU sized buffers. With these we stop after 350
packets. Thus adding another limit > 350 will not
slow us down.

Fair enough but won't apply with smaller packet
sizes, will it?

I still think a simpler argument carries more weight:

ring size is a hint from device about a burst size
it can tolerate. Based on benchmarks, we tweak
the limit to 2 * vq size as that seems to
perform a bit better, and is still safer
than no limit on # of packets as is done now.

but this needs testing with another ring size.
Could you try that please?


> and is also DEFAULT VRING_SIZE aligned.

Neither Linux nor virtio have a default vring size. It's a historical
construct that exists in qemu for qemu compatibility
reasons.

> >
> >> To evaluate this change, another tests were done using netperf(RR, TX) 
> >> between
> >> two machines with Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz. Result as 
> >> follow
> >> does not show obvious changes:
> >> 
> >> TCP_RR
> >> 
> >> size/sessions/+thu%/+normalize%
> >>1/   1/  -7%/-2%
> >>1/   4/  +1%/ 0%
> >>1/   8/  +1%/-2%
> >>   64/   1/  -6%/ 0%
> >>   64/   4/   0%/+2%
> >>   64/   8/   0%/ 0%
> >>  256/   1/  -3%/-4%
> >>  256/   4/  +3%/+4%
> >>  256/   8/  +2%/ 0%
> >> 
> >> UDP_RR
> >> 
> >> size/sessions/+thu%/+normalize%
> >>1/   1/  -5%/+1%
> >>1/   4/  +4%/+1%
> >>1/   8/  -1%/-1%
> >>   64/   1/  -2%/-3%
> >>   64/   4/  -5%/-1%
> >>   64/   8/   0%/-1%
> >>  256/   1/  +7%/+1%
> >>  256/   4/  +1%/+1%
> >>  256/   8/  +2%/+2%
> >> 
> >> TCP_STREAM
> >> 
> >> size/sessions/+thu%/+normalize%
> >>   64/   1/   0%/-3%
> >>   64/   4/  +3%/-1%
> >>   64/   8/  +9%/-4%
> >>  256/   1/  +1%/-4%
> >>  256/   4/  -1%/-1%
> >>  256/   8/  +7%/+5%
> >>  512/   1/  +1%/ 0%
> >>  512/   4/  +1%/-1%
> >>  512/   8/  +7%/-5%
> >> 1024/   1/   0%/-1%
> >> 1024/   4/  +3%/ 0%
> >> 1024/   8/  +8%/+5%
> >> 2048/   1/  +2%/+2%
> >> 2048/   4/  +1%/ 0%
> >> 2048/   8/  -2%/ 0%
> >> 4096/   1/  -2%/ 0%
> >> 4096/   4/  +2%/ 0%
> >> 4096/   8/  +9%/-2%
> >> 
> >> Signed-off-by: Haibin Zhang 
> >> Signed-off-by: Yunfang Tai 
> >> Signed-off-by: Lidong Chen 
> >> ---
> >>  drivers/vhost/net.c | 8 +++-
> >>  1 file changed, 7 insertions(+), 1 deletion(-)
> >> 
> >> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> >> index 8139bc70ad7d..13a23f3f3ea4 100644
> >> --- a/drivers/vhost/net.c
> >> +++ b/drivers/vhost/net.c
> >> @@ -44,6 +44,10 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero 
> >> Copy TX;"
> >>   * Using this limit prevents one virtqueue from starving others. */
> >>  #define VHOST_NET_WEIGHT 0x8
> >>  
> >> +/* Max number of packets transferred before requeueing the job.
> >> + * Using this limit prevents one virtqueue from starving rx. */
> >> +#define VHOST_NET_PKT_WEIGHT 512
> >> +
> >>  /* MAX number of TX used buffers for outstanding 

RE: [PATCH iproute2 rdma: Ignore unknown netlink attributes

2018-04-03 Thread Steve Wise


> -Original Message-
> From: Leon Romanovsky 
> Sent: Tuesday, April 3, 2018 2:29 AM
> To: Stephen Hemminger 
> Cc: Leon Romanovsky ; netdev
> ; RDMA mailing list  r...@vger.kernel.org>; David Ahern ; Steve Wise
> 
> Subject: [PATCH iproute2 rdma: Ignore unknown netlink attributes
> 
> From: Leon Romanovsky 
> 
> The check if netlink attributes supplied more than maximum supported
> is to strict and may lead to backward compatibility issues with old
> application with a newer kernel that supports new attribute.
> 
> CC: Steve Wise 
> Fixes: 74bd75c2b68d ("rdma: Add basic infrastructure for RDMA tool")
> Signed-off-by: Leon Romanovsky 
> ---
>  rdma/utils.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/rdma/utils.c b/rdma/utils.c
> index a2e08e91..5c1e736a 100644
> --- a/rdma/utils.c
> +++ b/rdma/utils.c
> @@ -399,7 +399,8 @@ int rd_attr_cb(const struct nlattr *attr, void *data)
>   int type;
> 
>   if (mnl_attr_type_valid(attr, RDMA_NLDEV_ATTR_MAX) < 0)
> - return MNL_CB_ERROR;
> + /* We received uknown attribute */
> + return MNL_CB_OK;
> 
>   type = mnl_attr_get_type(attr);
> 

Hey Leon,

So the resource parsing functions correctly ignore the unkown attrs and
print everything else?

Looks good.

Reviewed-by: Steve Wise 



Re: [PATCH v2 bpf-next 0/3] bpf/verifier: subprog/func_call simplifications

2018-04-03 Thread Edward Cree
On 03/04/18 02:08, Alexei Starovoitov wrote:
> I like patch 3 and going to play with it.
> How did you test it ?
Just test_verifier and test_progs (the latter has a failure
 "test_bpf_obj_id:FAIL:get-prog-info(fd) err 0 errno 2 i 0 type 1(1) info_len 
80(40) jit_enabled 0 jited_prog_len 0 xlated_prog_len 72"
 but that was already there before my patch).

> Do you have processed_insn numbers for
> cilium or selftests programs before and after?
> There should be no difference, right?
That's a good check, I'll do that.

> As far as patch 1 it was very difficult to review, since several logically
> different things clamped together. So breaking it apart:
> - converting two arrays of subprog_starts and subprog_stack_depth into
>   single array of struct bpf_subprog_info is a good thing
> - tsort is interesting, but not sure it's correct. more below
> - but main change of combining subprog logic with do_check is no good.

> There will be no 'do_check() across whole program' walk.
> Combining subprog pass with do_check is going into opposite direction
> of this long term work. Divide and conquer. Combining more things into
> do_check is the opposite of this programming principle.
The main object of my change here was to change the data structures, to
 store a subprogno in each insn aux rather than using bsearch() on the
 subprog_starts.  I have now figured out the algorithm to do this in its
 own pass (previously I thought this needed a recursive walk which is why
 I wanted to roll it into do_check() - doing more than one whole-program
 recursive walk seems like a bad idea.)

> My short term plan is to split basic instruction correctness checks
> out of do_check() loop into separate pass and run it early on.
I agree with that short term plan, sounds like a good idea.
I'm still not sure I understand the long-term plan, though; since most
 insns' input registers will still need to be checked (I'm assuming
 majority of most real ebpf programs consists of computing and
 dereferencing pointers), the data flow analysis will have to end up
 doing all the same register updates current do_check() does (though
 potentially in a different order), e.g. if a function is called three
 times it will have to analyse it with three sets of input registers.
Unless you have some way of specifying function preconditions, I don't
 see how it works.  In particular something like
    char *f(char *p)
    {
    *p++ = 0;
    return p;
    }
    int main(void)
    {
    char *p = "abc"; /* represents getting a ptr from ctx or map */
    p = f(p);
    p = f(p);
    p = f(p);
    return 0;
    }
 seems as though it would be difficult to analyse in any way more
 scalable than the current full recursive walk.  Unless you somehow
 propagate register state _backwards_ as constraints when analysing a
 function...?  In any case it seems like there are likely to be things
 which current verifier accepts which require 'whole-program' analysis
 to determine the dataflow (e.g. what if there were some conditional
 branches in f(), and the preconditions on p depended on the value of
 some other arg in r2?)

> As far as tsort approach for determining max stack depth...
> It's an interesting idea, but implementation is suffering from the same
> 'combine everything into one loop' coding issue.
> I think updating total_stack_depth math should be separate from sorting,
> since the same function can be part of different stacks with different max.
> I don't see how updating global subprog_info[i].total_stack_depth can
> be correct. It has to be different for every stack and the longest
> stack is not necessarily the deepest. May be I'm missing something,
> but combined sort and stack_depth math didn't make it easy to review.
> I find existing check_max_stack_depth() algo much easier to understand.
The sort _is_ the way to compute total stack depth.  The sort order isn't
 being stored anywhere; it's being done just so that each subprog gets
 looked at after all its callers have been considered.  So when it gets
 selected as a 'frontier node', its maximum stack depth is known, and can
 thus be used to update its callees (note that we do a max_t() with each
 callee's existing total_stack_depth, thus getting the deepest stack of
 all call chains to the function).
It may help to imagine drawing the call graph and labelling each node with
 a stack depth as it is visited; sadly that's difficult to show in an email
 (or a code comment).  But I can try to explain it a bit better than
 "/* Update callee total stack depth */".

I will also try to split up patch #1 into more pieces.  I mistakenly thought
 that existing check_max_stack_depth() depended on some invariants that I was
 removing, but I guess that was only true while I had non-contiguous subprogs.

-Ed


Re: [PATCH net-next RFC 0/5] ipv6: sr: introduce seg6local End.BPF action

2018-04-03 Thread David Lebrun

On 04/03/2018 12:16 PM, Mathieu Xhonneux wrote:



In patch 2 I was a bit concerned that:
+   struct seg6_bpf_srh_state *srh_state = (struct seg6_bpf_srh_state *)
+  &skb->cb;
would not collide with other users of skb->cb, but it seems the way
the hook is placed such usage should always be valid.
Would be good to add a comment describing the situation.

Yes, it's indeed a little hack, but this should be OK since the IPv6 layer does
not use the cb field. Another solution would be to create a new field in
__sk_buff but it's more cumbersome.
I will add a comment.


Good point. The IPv6 layer *does* use the cb field through the IP6CB() 
macro. It is first filled in ipv6_rcv() for ingress packets and used, 
among others, in the input path by extension headers processing 
functions to store EH offsets.


Given that input_action_end_bpf is called in the forwarding path 	 and 
terminates with a call to dst_input(), IP6CB() will be then reset by 
ipv6_rcv(), and the use of skb->cb here indeed should not collide with 
other users.





Looks like somewhat odd 'End.BPF' name comes from similar names in SRv6 draft.
Do you plan to disclose such End.BPF action in the draft as well?

This is something I've discussed with David Lebrun (the author of the Segment
Routing implementation). There's no plan to disclose an End.BPF action as-is
in the draft, since eBPF is really specific to Linux, and David doesn't mind not
having a 1:1 mapping between the actions of the draft and the implemented
ones. Writing "End.BPF" instead of just "bpf" is important to indicate that the
action will advance to the next segment by itself, like all other End actions.
One could imagine adding later a T.BPF action (a transit action), whose SID
wouldn't have to be a segment, but that could still e.g. add/edit/delete TLVs.



To clarify, I don't see why we shouldn't support "experimental" features 
that are not defined in draft-6man-segment-routing-header. However, we 
could create a separate draft describing the End.BPF feature, but that's 
perhaps best left for after the ongoing draft's last call.


David


Re: [PATCH v5 03/14] PCI: Add pcie_bandwidth_capable() to compute max supported link bandwidth

2018-04-03 Thread Bjorn Helgaas
On Mon, Apr 02, 2018 at 05:30:54PM -0700, Jacob Keller wrote:
> On Mon, Apr 2, 2018 at 7:05 AM, Bjorn Helgaas  wrote:
> > +/* PCIe speed to Mb/s reduced by encoding overhead */
> > +#define PCIE_SPEED2MBS_ENC(speed) \
> > +   ((speed) == PCIE_SPEED_16_0GT ? (16000*(128/130)) : \
> > +(speed) == PCIE_SPEED_8_0GT  ?  (8000*(128/130)) : \
> > +(speed) == PCIE_SPEED_5_0GT  ?  (5000*(8/10)) : \
> > +(speed) == PCIE_SPEED_2_5GT  ?  (2500*(8/10)) : \
> > +0)
> > +
> 
> Should this be "(speed * x ) / y" instead? wouldn't they calculate
> 128/130 and truncate that to zero before multiplying by the speed? Or
> are compilers smart enough to do this the other way to avoid the
> losses?

Yep, thanks for saving me yet more embarrassment.


Re: [PATCH net-next RFC 0/5] ipv6: sr: introduce seg6local End.BPF action

2018-04-03 Thread David Lebrun

On 04/03/2018 02:40 PM, David Lebrun wrote:

On 04/03/2018 12:16 PM, Mathieu Xhonneux wrote:



In patch 2 I was a bit concerned that:
+   struct seg6_bpf_srh_state *srh_state = (struct 
seg6_bpf_srh_state *)

+  &skb->cb;
would not collide with other users of skb->cb, but it seems the way
the hook is placed such usage should always be valid.
Would be good to add a comment describing the situation.
Yes, it's indeed a little hack, but this should be OK since the IPv6 
layer does

not use the cb field. Another solution would be to create a new field in
__sk_buff but it's more cumbersome.
I will add a comment.


Good point. The IPv6 layer *does* use the cb field through the IP6CB() 
macro. It is first filled in ipv6_rcv() for ingress packets and used, 
among others, in the input path by extension headers processing 
functions to store EH offsets.


Given that input_action_end_bpf is called in the forwarding path  
and terminates with a call to dst_input(), IP6CB() will be then reset by 
ipv6_rcv(), and the use of skb->cb here indeed should not collide with 
other users.


Actually I'm wrong here. dst_input() will call either ip6_input() or 
ip6_forward(), not ipv6_rcv(). Both functions expect IP6CB() to be set,

so using skb->cb here will interfere with them.

What about saving and restoring the IPv6 CB, similarly to what TCP does 
with tcp_v6_restore_cb() ?


David


Re: [PATCH] net: improve ipv4 performances

2018-04-03 Thread Douglas Caetano dos Santos
Hi Anton, everyone,

On 04/01/18 15:31, Anton Gary Ceph wrote:
> As the Linux networking stack is growing, more and more protocols are
> added, increasing the complexity of stack itself.
> Modern processors, contrary to common belief, are very bad in branch
> prediction, so it's our task to give hints to the compiler when possible.
> 
> After a few profiling and analysis, turned out that the ethertype field
> of the packets has the following distribution:
> 
> 92.1% ETH_P_IP
>  3.2% ETH_P_ARP
>  2.7% ETH_P_8021Q
>  1.4% ETH_P_PPP_SES
>  0.6% don't know/no opinion
> 
> From a projection on statistics collected by Google about IPv6 adoption[1],
> IPv6 should peak at 25% usage at the beginning of 2030. Hence, we should
> give proper hints to the compiler about the low IPv6 usage.

My two cents on the matter:

You should not consider favoring some parts of code in detriment of another 
just because of one use case. In your patch, you're considering one server that 
attends for IPv4 and IPv6 connections simultaneously, in a proportion seen on 
the Internet, but you completely disregard the use cases of servers that could 
serve, for example, only IPv6. What about those, just let them slow down?

What I think about such hints and optimizations - someone correct me if I'm 
wrong - is that they should be done not with specific use cases in mind, but 
according to the code flow in general. For example, it could be a good idea to 
slow down ARP requests, because there is AFAIK not such a server that attends 
only ARP (not that I'm advocating for it, just using as an example). But 
slowing down IPv6, as Eric already said, is utterly non-sense.

Again, "low IPv6 usage" doesn't mean code that is barely touched, with an 
IPv6-only server being the obvious example.

-- 
Douglas


meine Spende an dich

2018-04-03 Thread Mrs Nelma
Hallo Lieber, ich habe eine Spende von 4.600.000,00 Euro, die ich Ihnen geben 
möchte, um den Armen und Waisen in Ihrer Gemeinde zu helfen ... Bitte antworten 
Sie für weitere Details, um meine Spende zu erhalten

Grüße

Nelma Ruaan


Re: [PATCH net-next 09/11] devlink: convert occ_get op to separate registration

2018-04-03 Thread David Ahern
On 4/3/18 1:32 AM, Jiri Pirko wrote:
> Fri, Mar 30, 2018 at 04:45:50PM CEST, dsah...@gmail.com wrote:
>> On 3/29/18 2:33 PM, Ido Schimmel wrote:
>>> From: Jiri Pirko 
>>>
>>> This resolves race during initialization where the resources with
>>> ops are registered before driver and the structures used by occ_get
>>> op is initialized. So keep occ_get callbacks registered only when
>>> all structs are initialized.
>>
>> Why can't the occ_get handler look at some flag in an mlxsw struct to
>> know if the system has initialized?
>>
>> Separate registration here is awkward. You register a resource and then
>> register its op later.
> 
> The separation is exactly why this patch is made. Note that devlink
> resouce is registered by core way before the initialization is done and
> the driver is actually able to perform the op. Also consider "reload"

That's how you have chose to code it. I hit this problem adding devlink
to netdevsim; the solution was to fix the init order.

> case, when the resource is still registered and the driver unloads and
> loads again. For that makes perfect sense to have that separated.
> Flag would just make things odd. Also, the priv could not be used in
> that case.
> 

I am not aware of any other API where you invoked the register function
at point A and then later add the operations at point B. In every API
that comes to mind the ops are part of the register.

I am sure there are options for you to fix the init order of mlxsw
without making the devlink API awkward.


Re: [PATCH net-next RFC 0/5] ipv6: sr: introduce seg6local End.BPF action

2018-04-03 Thread Eric Dumazet


On 04/03/2018 07:25 AM, David Lebrun wrote:
> 
> What about saving and restoring the IPv6 CB, similarly to what TCP does with 
> tcp_v6_restore_cb() ?

Note that TCP only moves IPCB around in skb->cb[] for cache locality gains.

Now we switched to rb-tree for out-of-order queue, these gains might be 
marginal.


Re: [net-next V9 PATCH 00/16] XDP redirect memory return API

2018-04-03 Thread David Miller
From: Jesper Dangaard Brouer 
Date: Tue, 03 Apr 2018 13:07:36 +0200

> This is V9, but it's worth mentioning that V8 was send against
> net-next, because i40e got XDP_REDIRECT support in-between V6, and it
> doesn't exist in bpf-next yet.  Most significant change in V8 was that
> page_pool only gets compiled into the kernel when a drivers Kconfig
> 'select' the feature.

Jesper, this series now looks good to me, however the net-next tree is
closed at this point.

Don't worry, just resubmit when net-next opens back up.

Thanks!


Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-03 Thread David Ahern
On 4/3/18 1:40 AM, Siwei Liu wrote:
>> There are other use cases that want to hide a device from userspace.
> 
> Can you elaborate your case in more details? Looking at the links
> below I realize that the purpose of hiding devices in your case is
> quite different from the our migration case. Particularly, I don't

some kernel drivers create "control" netdev's. They are not intended for
users to manipulate and doing so may actually break networking.

> like the part of elaborately allowing user to manipulate the link's
> visibility - things fall apart easily while live migration is on
> going. And, why doing additional check for invisible links in every
> for_each_netdev() and its friends. This is effectively changing
> semantics of internal APIs that exist for decades.

Read the patch again: there are 40 references to for_each_netdev and
that patch touches 2 of them -- link dumps via rtnetlink and link dumps
via ioctl.

>> one that includes an API for users to list all devices -- even ones
> 
> What kind of API you would like to query for hidden devices?
> rtnetlink? a private socket API? or something else?

There are existing, established APIs for dumping links. No new API is
needed. As suggested in the 2 patches I referenced the hidden /
invisibility cloak is an attribute of the device. When a link dump is
requested if the attribute is set, the device is skipped and not
included in the dump. However, if the user knows the device name the
GETLINK / SETLINK / DELLINK apis all work as normal. This allows the
device to be hidden from apps like snmpd, lldpd, etc, yet still usable.

> 
> For our case, the sysfs interface is what we need and is sufficient,
> since udev is the main target we'd like to support to make the naming
> of virtio_bypass consistent and compatible.

You are not hiding a device if it is visible in 1 API (/sysfs) and not
visible by another API (rtnetlink). That only creates confusion.

> 
>> hidden by default.
>>
>> https://github.com/dsahern/linux/commit/48a80a00eac284e58bae04af10a5a932dd7aee00
>>
>> https://github.com/dsahern/iproute2/commit/7563f5b26f5539960e99066e34a995d22ea908ed
>>
>> Also, why are you suggesting that the device should still be visible via
>> /sysfs? That leads to inconsistent views of networking state - /sys
>> shows a device but a link dump does not.
> 
> See my clarifications above. I don't mind kernel-only netdevs being
> visible via sysfs, as that way we get a good trade-off between
> backwards compatibility and visibility. There's still kobject created
> there right. Bottom line is that all kernel devices and its life-cycle
> uevents are made invisible to userpace network utilities, and I think
> it simply gets to the goal of not breaking existing apps while being
> able to add new features.


Re: [PATCH net-next RFC V1 5/5] net: mdio: Add a driver for InES time stamping IP core.

2018-04-03 Thread Richard Cochran
On Tue, Apr 03, 2018 at 03:13:19PM +0200, Andrew Lunn wrote:
> Have you tried implementing it using a phandle in the phy node,
> pointing to the time stamping device?

Not yet, but I'll take this up for V2, after the merge window...

Thinking about MII, it really is a 1:1 connection between the MAC and
the PHY.  It has no representation in the current code, at least not
yet.  It is too bad about the naming of mii_bus, oh well.  While
hanging this thing off of the PHY isn't really great modeling (it
isn't a sub-device of the PHY in any sense), still this will work well
enough to enable the new functionality.

Thanks,
Richard


Re: [PATCH v15 ] net/veth/XDP: Line-rate packet forwarding in kernel

2018-04-03 Thread David Ahern
On 4/2/18 12:16 PM, Alexei Starovoitov wrote:
> On Mon, Apr 02, 2018 at 12:09:44PM -0600, David Ahern wrote:
>> On 4/2/18 12:03 PM, John Fastabend wrote:
>>>
>>> Can the above be a normal BPF helper that returns an
>>> ifindex? Then something roughly like this patter would
>>> work for all drivers with redirect support,
>>>
>>>
>>>  route_ifindex = ip_route_lookup(__daddr, )
>>>  if (!route_ifindex)
>>>return do_foo()
>>>  return xdp_redirect(route_ifindex);
>>>  
>>> So my suggestion is,
>>>
>>>   1. enable veth xdp (including redirect support)
>>>   2. add a helper to lookup route from routing table
>>>
>>> Alternatively you can skip step (2) and encode the routing
>>> table in BPF directly. Maybe we need a more efficient data
>>> structure but that should also work.
>>>
>>
>> That's what I have here:
>>
>> https://github.com/dsahern/linux/commit/bab42f158c0925339f7519df7fb2cde8eac33aa8
> 
> was wondering what's up with the delay and when are you going to
> submit them officially...
> The use case came up several times.
> 

I need to find time to come back to that set. As I recall there a number
of outstanding issues:

1. you and Daniel had comments about the bpf_func_proto declarations

2. Jesper had concerns about xdp redirect to any netdev. e.g., How does
the lookup know the egress netdev supports xdp? Right now you can try
and the packet is dropped if it is not supported.

3. VLAN devices. I suspect these will affect the final bpf function
prototype. It would awkward to have 1 forwarding API for non-vlan
devices and a second for vlan devices, hence the need to resolve this
before it goes in.

4. What about other stacked devices - bonds and bridges - will those
just work with the bpf helper? VRF is already handled of course. ;-)


Re: [PATCH 00/15] ARM: pxa: switch to DMA slave maps

2018-04-03 Thread Ulf Hansson
On 2 April 2018 at 16:26, Robert Jarzmik  wrote:
> Hi,
>
> This serie is aimed at removing the dmaengine slave compat use, and transfer
> knowledge of the DMA requestors into architecture code.
>
> This was discussed/advised by Arnd a couple of years back, it's almost time.
>
> The serie is divided in 3 phasees :
>  - phase 1 : patch 1/15 and patch 2/15
>=> this is the preparation work
>  - phase 2 : patches 3/15 .. 10/15
>=> this is the switch of all the drivers
>=> this one will require either an Ack of the maintainers or be taken by 
> them
>   once phase 1 is merged
>  - phase 3 : patches 11/15
>=> this is the last part, cleanup and removal of export of the DMA filter
>   function
>
> As this looks like a patch bomb, each maintainer expressing for his tree 
> either
> an Ack or "I want to take through my tree" will be spared in the next 
> iterations
> of this serie.

Perhaps an option is to send this hole series as PR for 3.17 rc1, that
would removed some churns and make this faster/easier? Well, if you
receive the needed acks of course.

For the mmc change:

Acked-by: Ulf Hansson 

Kind regards
Uffe

>
> Several of these changes have been tested on actual hardware, including :
>  - pxamci
>  - pxa_camera
>  - smc*
>  - ASoC and SSP
>
> Happy review.
>
> Robert Jarzmik (15):
>   dmaengine: pxa: use a dma slave map
>   ARM: pxa: add dma slave map
>   mmc: pxamci: remove the dmaengine compat need
>   media: pxa_camera: remove the dmaengine compat need
>   mtd: nand: pxa3xx: remove the dmaengine compat need
>   net: smc911x: remove the dmaengine compat need
>   net: smc91x: remove the dmaengine compat need
>   ASoC: pxa: remove the dmaengine compat need
>   net: irda: pxaficp_ir: remove the dmaengine compat need
>   ata: pata_pxa: remove the dmaengine compat need
>   dmaengine: pxa: document pxad_param
>   dmaengine: pxa: make the filter function internal
>   ARM: pxa: remove the DMA IO resources
>   ARM: pxa: change SSP devices allocation
>   ARM: pxa: change SSP DMA channels allocation
>
>  arch/arm/mach-pxa/devices.c   | 269 
> ++
>  arch/arm/mach-pxa/devices.h   |  14 +-
>  arch/arm/mach-pxa/include/mach/audio.h|  12 ++
>  arch/arm/mach-pxa/pxa25x.c|   4 +-
>  arch/arm/mach-pxa/pxa27x.c|   4 +-
>  arch/arm/mach-pxa/pxa3xx.c|   5 +-
>  arch/arm/plat-pxa/ssp.c   |  50 +-
>  drivers/ata/pata_pxa.c|  10 +-
>  drivers/dma/pxa_dma.c |  13 +-
>  drivers/media/platform/pxa_camera.c   |  22 +--
>  drivers/mmc/host/pxamci.c |  29 +---
>  drivers/mtd/nand/pxa3xx_nand.c|  10 +-
>  drivers/net/ethernet/smsc/smc911x.c   |  16 +-
>  drivers/net/ethernet/smsc/smc91x.c|  12 +-
>  drivers/net/ethernet/smsc/smc91x.h|   1 -
>  drivers/staging/irda/drivers/pxaficp_ir.c |  14 +-
>  include/linux/dma/pxa-dma.h   |  20 +--
>  include/linux/platform_data/mmp_dma.h |   4 +
>  include/linux/pxa2xx_ssp.h|   4 +-
>  sound/arm/pxa2xx-ac97.c   |  14 +-
>  sound/arm/pxa2xx-pcm-lib.c|   6 +-
>  sound/soc/pxa/pxa-ssp.c   |   5 +-
>  sound/soc/pxa/pxa2xx-ac97.c   |  32 +---
>  23 files changed, 196 insertions(+), 374 deletions(-)
>
> --
> 2.11.0
>


Socket error queue with timestamping and SOF_TIMESTAMPING_OPT_CMSG

2018-04-03 Thread Miroslav Lichvar
I came across an interesting issue with error messages in sockets with
enabled timestamping using the SOF_TIMESTAMPING_OPT_CMSG option. When
the socket is connected and there is an error (e.g. due to destination
unreachable ICMP), select() indicates there is an exception on the
socket, but recvmsg() reading from the error queue returns with EAGAIN
and the application gets stuck in an infinite loop.

Some observations:
- it happens on both AF_INET and AF_INET6 SOCK_DGRAM sockets
- enabling the IP_RECVERR option avoids getting EAGAIN
- using recvmmsg() instead of recvmsg() avoids getting EAGAIN
  (that is why I didn't notice it earlier)
- disabling TX timestamping doesn't prevent the socket from having an
  exception
- reading from the non-error queue stops the loop

Is this a bug?

It looks to me like SOF_TIMESTAMPING_OPT_CMSG implicitly, but only
partially, enables IP_RECVERR. Are applications required to use
IP_RECVERR in this case? My expectation was that without IP_RECVERR
the error queue would only have messages with transmit timestamps, and
nothing would change with reporting of real errors. Also, from the
documentation I had an impression that SOF_TIMESTAMPING_OPT_CMSG is a
no-op on AF_INET6 sockets.

-- 
Miroslav Lichvar


Re: [PATCH 02/15] ARM: pxa: add dma slave map

2018-04-03 Thread Robert Jarzmik
Arnd Bergmann  writes:

>> +   { "smc911x.0", "rx", PDMA_FILTER_PARAM(LOWEST, -1) },
>> +   { "smc911x.0", "tx", PDMA_FILTER_PARAM(LOWEST, -1) },
>> +   { "smc91x.0", "data", PDMA_FILTER_PARAM(LOWEST, -1) },
>
> This one is interesting, as you are dealing with an off-chip device,
> and the channel number is '-'1. How does this even work? Does it
> mean

This relies on pxa_dma, in which the "-1" for a requestor line means "no
requestor" or said in another way "always requesting". As a consequence, as soon
as the DMA descriptors are queued, the transfer begins, and it is supposed
implicitely that the FIFO output availability is at least as quick as the system
bus and the DMA size is perfectly fit for the FIFO available bytes.

This is what has been the underlying of DMA transfers of smc91x(x) on the PXA
platforms, where the smc91x(s) are directly wired on the system bus (the same
bus having DRAM, SRAM, IO-mapped devices).

>
>> +   /* PXA25x specific map */
>> +   { "pxa25x-ssp.0", "rx", PDMA_FILTER_PARAM(LOWEST, 13) },
>> +   { "pxa25x-ssp.0", "tx", PDMA_FILTER_PARAM(LOWEST, 14) },
>> +   { "pxa25x-nssp.1", "rx", PDMA_FILTER_PARAM(LOWEST, 15) },
>> +   { "pxa25x-nssp.1", "tx", PDMA_FILTER_PARAM(LOWEST, 16) },
>> +   { "pxa25x-nssp.2", "rx", PDMA_FILTER_PARAM(LOWEST, 23) },
>> +   { "pxa25x-nssp.2", "tx", PDMA_FILTER_PARAM(LOWEST, 24) },
>> +   { "pxa-pcm-audio", "nssp2_rx", PDMA_FILTER_PARAM(LOWEST, 15) },
>> +   { "pxa-pcm-audio", "nssp2_tx", PDMA_FILTER_PARAM(LOWEST, 16) },
>> +   { "pxa-pcm-audio", "nssp3_rx", PDMA_FILTER_PARAM(LOWEST, 23) },
>> +   { "pxa-pcm-audio", "nssp3_tx", PDMA_FILTER_PARAM(LOWEST, 24) },
>> +
>> +   /* PXA27x specific map */
>> +   { "pxa-pcm-audio", "ssp3_rx", PDMA_FILTER_PARAM(LOWEST, 66) },
>> +   { "pxa-pcm-audio", "ssp3_tx", PDMA_FILTER_PARAM(LOWEST, 67) },
>> +   { "pxa27x-camera.0", "CI_Y", PDMA_FILTER_PARAM(HIGHEST, 68) },
>> +   { "pxa27x-camera.0", "CI_U", PDMA_FILTER_PARAM(HIGHEST, 69) },
>> +   { "pxa27x-camera.0", "CI_V", PDMA_FILTER_PARAM(HIGHEST, 70) },
>> +
>> +   /* PXA3xx specific map */
>> +   { "pxa-pcm-audio", "ssp4_rx", PDMA_FILTER_PARAM(LOWEST, 2) },
>> +   { "pxa-pcm-audio", "ssp4_tx", PDMA_FILTER_PARAM(LOWEST, 3) },
>> +   { "pxa2xx-mci.1", "rx", PDMA_FILTER_PARAM(LOWEST, 93) },
>> +   { "pxa2xx-mci.1", "tx", PDMA_FILTER_PARAM(LOWEST, 94) },
>> +   { "pxa3xx-nand", "data", PDMA_FILTER_PARAM(LOWEST, 97) },
>> +   { "pxa2xx-mci.2", "rx", PDMA_FILTER_PARAM(LOWEST, 100) },
>> +   { "pxa2xx-mci.2", "tx", PDMA_FILTER_PARAM(LOWEST, 101) },
>> +};
>
> Since more than half the entries in here are chip specific, maybe it would be
> better to split that table into three and have a copy for each one in
> arch/arm/mach-pxa/pxa{25x.27x.3xx}.c?
Mmmh, today the split is :
 - 16 common entries
 - 10 pxa25x specific entries
 - 5 pxa27x specific entries
 - 7 pxa3xx specific entries
 => total of 38 lines

After the split we'll have :
 - 26 pxa25x specific entries
 - 21 pxa27x specific entries
 - 23 pxa3xx specific entries
 => total of 70 lines

That doubles the number of lines, not counting the declarations, and amending of
pxa2xx_set_dmac_info().

If you think it's worth it, what is the driving benefit behind ?

> Does that mean it's actually a memory-to-memory transfer with a device being
> on the external SRAM interface?
I'm taking this is the follow up to the "-1" question :0

Cheers.

-- 
Robert


Re: [PATCH net-next 09/11] devlink: convert occ_get op to separate registration

2018-04-03 Thread Jiri Pirko
Tue, Apr 03, 2018 at 04:33:11PM CEST, dsah...@gmail.com wrote:
>On 4/3/18 1:32 AM, Jiri Pirko wrote:
>> Fri, Mar 30, 2018 at 04:45:50PM CEST, dsah...@gmail.com wrote:
>>> On 3/29/18 2:33 PM, Ido Schimmel wrote:
 From: Jiri Pirko 

 This resolves race during initialization where the resources with
 ops are registered before driver and the structures used by occ_get
 op is initialized. So keep occ_get callbacks registered only when
 all structs are initialized.
>>>
>>> Why can't the occ_get handler look at some flag in an mlxsw struct to
>>> know if the system has initialized?
>>>
>>> Separate registration here is awkward. You register a resource and then
>>> register its op later.
>> 
>> The separation is exactly why this patch is made. Note that devlink
>> resouce is registered by core way before the initialization is done and
>> the driver is actually able to perform the op. Also consider "reload"
>
>That's how you have chose to code it. I hit this problem adding devlink
>to netdevsim; the solution was to fix the init order.

This is not about init order, at all. On reaload netdevs and internal
driver structures disappear and appear again. And in between currently,
the op is working with memory which is freed. That's the reason for this
patch.


>
>> case, when the resource is still registered and the driver unloads and
>> loads again. For that makes perfect sense to have that separated.
>> Flag would just make things odd. Also, the priv could not be used in
>> that case.
>> 
>
>I am not aware of any other API where you invoked the register function
>at point A and then later add the operations at point B. In every API
>that comes to mind the ops are part of the register.

I think that you just don't see any similar API.


>
>I am sure there are options for you to fix the init order of mlxsw
>without making the devlink API awkward.

Again, not about init order, at all. I have no clue why you think so...



Re: [PATCH 02/15] ARM: pxa: add dma slave map

2018-04-03 Thread Arnd Bergmann
On Tue, Apr 3, 2018 at 5:18 PM, Robert Jarzmik  wrote:
> Arnd Bergmann  writes:
>
>>> +   { "smc911x.0", "rx", PDMA_FILTER_PARAM(LOWEST, -1) },
>>> +   { "smc911x.0", "tx", PDMA_FILTER_PARAM(LOWEST, -1) },
>>> +   { "smc91x.0", "data", PDMA_FILTER_PARAM(LOWEST, -1) },
>>
>> This one is interesting, as you are dealing with an off-chip device,
>> and the channel number is '-'1. How does this even work? Does it
>> mean
>
> This relies on pxa_dma, in which the "-1" for a requestor line means "no
> requestor" or said in another way "always requesting". As a consequence, as 
> soon
> as the DMA descriptors are queued, the transfer begins, and it is supposed
> implicitely that the FIFO output availability is at least as quick as the 
> system
> bus and the DMA size is perfectly fit for the FIFO available bytes.
>
> This is what has been the underlying of DMA transfers of smc91x(x) on the PXA
> platforms, where the smc91x(s) are directly wired on the system bus (the same
> bus having DRAM, SRAM, IO-mapped devices).

Ok, I looked at the driver in more detail now and found the scary parts.
So it's using the async DMA interface to do synchronous DMA in
interrupt context in order to transfer the rx data faster than an readsl()
would, correct?

It still feels odd to me that there is an entry in the slave map for
a device that does not have a request line. However, it also seems
that the entire code in those two drivers that deals with DMA is specific
to PXA anyway, so maybe it can be done differently: instead of
calling dma_request_slave_channel_compat() or dma_request_chan()
with a fake request line, how about calling dma_request_channel()
with an NULL filter function and data, and have the driver handle
the empty data case the same way as the rq=-1 case today?

>>> +   /* PXA25x specific map */
>>> +   { "pxa25x-ssp.0", "rx", PDMA_FILTER_PARAM(LOWEST, 13) },
>>> +   { "pxa25x-ssp.0", "tx", PDMA_FILTER_PARAM(LOWEST, 14) },
>>> +   { "pxa25x-nssp.1", "rx", PDMA_FILTER_PARAM(LOWEST, 15) },
>>> +   { "pxa25x-nssp.1", "tx", PDMA_FILTER_PARAM(LOWEST, 16) },
>>> +   { "pxa25x-nssp.2", "rx", PDMA_FILTER_PARAM(LOWEST, 23) },
>>> +   { "pxa25x-nssp.2", "tx", PDMA_FILTER_PARAM(LOWEST, 24) },
>>> +   { "pxa-pcm-audio", "nssp2_rx", PDMA_FILTER_PARAM(LOWEST, 15) },
>>> +   { "pxa-pcm-audio", "nssp2_tx", PDMA_FILTER_PARAM(LOWEST, 16) },
>>> +   { "pxa-pcm-audio", "nssp3_rx", PDMA_FILTER_PARAM(LOWEST, 23) },
>>> +   { "pxa-pcm-audio", "nssp3_tx", PDMA_FILTER_PARAM(LOWEST, 24) },
>>> +
>>> +   /* PXA27x specific map */
>>> +   { "pxa-pcm-audio", "ssp3_rx", PDMA_FILTER_PARAM(LOWEST, 66) },
>>> +   { "pxa-pcm-audio", "ssp3_tx", PDMA_FILTER_PARAM(LOWEST, 67) },
>>> +   { "pxa27x-camera.0", "CI_Y", PDMA_FILTER_PARAM(HIGHEST, 68) },
>>> +   { "pxa27x-camera.0", "CI_U", PDMA_FILTER_PARAM(HIGHEST, 69) },
>>> +   { "pxa27x-camera.0", "CI_V", PDMA_FILTER_PARAM(HIGHEST, 70) },
>>> +
>>> +   /* PXA3xx specific map */
>>> +   { "pxa-pcm-audio", "ssp4_rx", PDMA_FILTER_PARAM(LOWEST, 2) },
>>> +   { "pxa-pcm-audio", "ssp4_tx", PDMA_FILTER_PARAM(LOWEST, 3) },
>>> +   { "pxa2xx-mci.1", "rx", PDMA_FILTER_PARAM(LOWEST, 93) },
>>> +   { "pxa2xx-mci.1", "tx", PDMA_FILTER_PARAM(LOWEST, 94) },
>>> +   { "pxa3xx-nand", "data", PDMA_FILTER_PARAM(LOWEST, 97) },
>>> +   { "pxa2xx-mci.2", "rx", PDMA_FILTER_PARAM(LOWEST, 100) },
>>> +   { "pxa2xx-mci.2", "tx", PDMA_FILTER_PARAM(LOWEST, 101) },
>>> +};
>>
>> Since more than half the entries in here are chip specific, maybe it would be
>> better to split that table into three and have a copy for each one in
>> arch/arm/mach-pxa/pxa{25x.27x.3xx}.c?
> Mmmh, today the split is :
>  - 16 common entries
>  - 10 pxa25x specific entries
>  - 5 pxa27x specific entries
>  - 7 pxa3xx specific entries
>  => total of 38 lines
>
> After the split we'll have :
>  - 26 pxa25x specific entries
>  - 21 pxa27x specific entries
>  - 23 pxa3xx specific entries
>  => total of 70 lines
>
> That doubles the number of lines, not counting the declarations, and amending 
> of
> pxa2xx_set_dmac_info().
>
> If you think it's worth it, what is the driving benefit behind ?

It seems a bit cleaner to only register the tables for the dma lines that
are actually present on a given chip.

>> Does that mean it's actually a memory-to-memory transfer with a device being
>> on the external SRAM interface?
> I'm taking this is the follow up to the "-1" question :0

Right.

Arnd


[PATCH iproute2] ip/l2tp: remove offset and peer-offset options

2018-04-03 Thread Guillaume Nault
Ignore options "peer-offset" and "offset" when creating sessions. Keep
them when dumping sessions in order to avoid breaking external scripts.

"peer-offset" has always been a noop in iproute2. "offset" is now
ignored in Linux 4.16 (and was broken before that).

Signed-off-by: Guillaume Nault 
---
 ip/ipl2tp.c| 23 ---
 man/man8/ip-l2tp.8 | 16 
 2 files changed, 4 insertions(+), 35 deletions(-)

diff --git a/ip/ipl2tp.c b/ip/ipl2tp.c
index 750f912a..427e0249 100644
--- a/ip/ipl2tp.c
+++ b/ip/ipl2tp.c
@@ -42,8 +42,6 @@ struct l2tp_parm {
uint32_t peer_tunnel_id;
uint32_t session_id;
uint32_t peer_session_id;
-   uint32_t offset;
-   uint32_t peer_offset;
enum l2tp_encap_type encap;
uint16_t local_udp_port;
uint16_t peer_udp_port;
@@ -174,8 +172,6 @@ static int create_session(struct l2tp_parm *p)
if (p->reorder_timeout)
addattr64(&req.n, 1024, L2TP_ATTR_RECV_TIMEOUT,
  p->reorder_timeout);
-   if (p->offset)
-   addattr16(&req.n, 1024, L2TP_ATTR_OFFSET, p->offset);
if (p->cookie_len)
addattr_l(&req.n, 1024, L2TP_ATTR_COOKIE,
  p->cookie, p->cookie_len);
@@ -310,8 +306,8 @@ static void print_session(struct l2tp_data *data)
print_string(PRINT_FP, NULL, "%s", _SL_);
}
 
-   print_uint(PRINT_ANY, "offset", "  offset %u,", p->offset);
-   print_uint(PRINT_ANY, "peer_offset", " peer offset %u\n", 
p->peer_offset);
+   print_uint(PRINT_ANY, "offset", "  offset %u,", 0);
+   print_uint(PRINT_ANY, "peer_offset", " peer offset %u\n", 0);
 
if (p->cookie_len > 0)
print_cookie("cookie", "cookie",
@@ -362,8 +358,6 @@ static int get_response(struct nlmsghdr *n, void *arg)
p->pw_type = rta_getattr_u16(attrs[L2TP_ATTR_PW_TYPE]);
if (attrs[L2TP_ATTR_ENCAP_TYPE])
p->encap = rta_getattr_u16(attrs[L2TP_ATTR_ENCAP_TYPE]);
-   if (attrs[L2TP_ATTR_OFFSET])
-   p->offset = rta_getattr_u16(attrs[L2TP_ATTR_OFFSET]);
if (attrs[L2TP_ATTR_DATA_SEQ])
p->data_seq = rta_getattr_u16(attrs[L2TP_ATTR_DATA_SEQ]);
if (attrs[L2TP_ATTR_CONN_ID])
@@ -550,7 +544,6 @@ static void usage(void)
"  tunnel_id ID\n"
"  session_id ID peer_session_id ID\n"
"  [ cookie HEXSTR ] [ peer_cookie HEXSTR ]\n"
-   "  [ offset OFFSET ] [ peer_offset OFFSET ]\n"
"  [ seq { none | send | recv | both } ]\n"
"  [ l2spec_type L2SPEC ]\n"
"   ip l2tp del tunnel tunnel_id ID\n"
@@ -678,19 +671,11 @@ static int parse_args(int argc, char **argv, int cmd, 
struct l2tp_parm *p)
invarg("invalid option for udp6_csum_tx\n"
, *argv);
} else if (strcmp(*argv, "offset") == 0) {
-   __u8 uval;
-
+   fprintf(stderr, "Ignoring option \"offset\"\n");
NEXT_ARG();
-   if (get_u8(&uval, *argv, 0))
-   invarg("invalid offset\n", *argv);
-   p->offset = uval;
} else if (strcmp(*argv, "peer_offset") == 0) {
-   __u8 uval;
-
+   fprintf(stderr, "Ignoring option \"peer_offset\"\n");
NEXT_ARG();
-   if (get_u8(&uval, *argv, 0))
-   invarg("invalid offset\n", *argv);
-   p->peer_offset = uval;
} else if (strcmp(*argv, "cookie") == 0) {
int slen;
 
diff --git a/man/man8/ip-l2tp.8 b/man/man8/ip-l2tp.8
index 8ce630a6..9aba6bec 100644
--- a/man/man8/ip-l2tp.8
+++ b/man/man8/ip-l2tp.8
@@ -59,12 +59,6 @@ ip-l2tp - L2TPv3 static unmanaged tunnel configuration
 .br
 .RB "[ " seq " { " none " | " send " | " recv " | " both " } ]"
 .br
-.RB "[ " offset
-.IR OFFSET
-.RB " ] [ " peer_offset
-.IR OFFSET
-.RB " ]"
-.br
 .ti -8
 .BR "ip l2tp del tunnel"
 .B tunnel_id
@@ -285,16 +279,6 @@ Default is
 .br
 Valid values are:
 .BR none ", " send ", " recv ", " both "."
-.TP
-.BI offset " OFFSET"
-sets the byte offset from the L2TP header where user data starts in
-transmitted L2TP data packets. This is hardly ever used. If set, the
-value must match the peer_offset value used at the peer. Default is 0.
-.TP
-.BI peer_offset " OFFSET"
-sets the byte offset from the L2TP header where user data starts in
-received L2TP data packets. This is hardly ever used. If set, the
-value must match the offset value used at the peer. Default is 0.
 .SS ip l2tp del session - destroy a session
 .TP
 .BI tunnel_id " ID"
-- 
2.16.3



Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-03 Thread Jiri Pirko
Sun, Apr 01, 2018 at 06:11:29PM CEST, dsah...@gmail.com wrote:
>On 4/1/18 3:13 AM, Si-Wei Liu wrote:
>> Hidden netdevice is not visible to userspace such that
>> typical network utilites e.g. ip, ifconfig and et al,
>> cannot sense its existence or configure it. Internally
>> hidden netdev may associate with an upper level netdev
>> that userspace has access to. Although userspace cannot
>> manipulate the lower netdev directly, user may control
>> or configure the underlying hidden device through the
>> upper-level netdev. For identification purpose, the
>> kobject for hidden netdev still presents in the sysfs
>> hierarchy, however, no uevent message will be generated
>> when the sysfs entry is created, modified or destroyed.
>> 
>> For that end, a separate namescope needs to be carved
>> out for IFF_HIDDEN netdevs. As of now netdev name that
>> starts with colon i.e. ':' is invalid in userspace,
>> since socket ioctls such as SIOCGIFCONF use ':' as the
>> separator for ifname. The absence of namescope started
>> with ':' can rightly be used as the namescope for
>> the kernel-only IFF_HIDDEN netdevs.
>> 
>> Signed-off-by: Si-Wei Liu 
>> ---
>>  include/linux/netdevice.h   |  12 ++
>>  include/net/net_namespace.h |   2 +
>>  net/core/dev.c  | 281 
>> ++--
>>  net/core/net_namespace.c|   1 +
>>  4 files changed, 263 insertions(+), 33 deletions(-)
>> 
>
>There are other use cases that want to hide a device from userspace. I

What usecases do you have in mind?

>would prefer a better solution than playing games with name prefixes and
>one that includes an API for users to list all devices -- even ones
>hidden by default.

Netdevice hiding feels a bit scarry for me. This smells like a workaround
for userspace issues. Why can't the netdevice be visible always and
userspace would know what is it and what should it do with it?

Once we start with hiding, there are other things related to that which
appear. Like who can see what, levels of visibility etc...


>
>https://github.com/dsahern/linux/commit/48a80a00eac284e58bae04af10a5a932dd7aee00
>
>https://github.com/dsahern/iproute2/commit/7563f5b26f5539960e99066e34a995d22ea908ed
>
>Also, why are you suggesting that the device should still be visible via
>/sysfs? That leads to inconsistent views of networking state - /sys
>shows a device but a link dump does not.


Re: Socket error queue with timestamping and SOF_TIMESTAMPING_OPT_CMSG

2018-04-03 Thread Soheil Hassas Yeganeh
On Tue, Apr 3, 2018 at 11:19 AM Miroslav Lichvar  wrote:
>
> I came across an interesting issue with error messages in sockets with
> enabled timestamping using the SOF_TIMESTAMPING_OPT_CMSG option. When
> the socket is connected and there is an error (e.g. due to destination
> unreachable ICMP), select() indicates there is an exception on the
> socket, but recvmsg() reading from the error queue returns with EAGAIN
> and the application gets stuck in an infinite loop.
>
> Some observations:
> - it happens on both AF_INET and AF_INET6 SOCK_DGRAM sockets
> - enabling the IP_RECVERR option avoids getting EAGAIN
> - using recvmmsg() instead of recvmsg() avoids getting EAGAIN
>   (that is why I didn't notice it earlier)
> - disabling TX timestamping doesn't prevent the socket from having an
>   exception
> - reading from the non-error queue stops the loop
>
> Is this a bug?

POLLERR and select exceptions on an FD indicate that either (a) there
is a message on the error queue, or (b) sk_err is set. Because of (b),
it's not sufficient to only call recvmsg(MSG_ERRQUEUE). For TCP, we
must always read or write from the socket after POLLERR. For UDP,
getsockopt(SO_ERROR) would return the actual socket error and will
clear the sk_err value.

recvmmsg had a bug and was checking sk_err when we read from error
queue, before actually reading the error queue. That was really a bug
in recvmmsg which should be fixed after:
https://patchwork.ozlabs.org/patch/878861/

> It looks to me like SOF_TIMESTAMPING_OPT_CMSG implicitly, but only
> partially, enables IP_RECVERR. Are applications required to use
> IP_RECVERR in this case? My expectation was that without IP_RECVERR
> the error queue would only have messages with transmit timestamps, and
> nothing would change with reporting of real errors.

No, IP_RECVERR and SOF_TIMESTAMPING_OPT_CMSG are completely
orthogonal. When we have IP_RECVERR, the ICMP packet is simply added
to the error queue. Without IP_RECVERR, an ICMP error packet results
in setting the sk_err and the ICMP packet is discarded. This behavior is
completely unrelated to SOF_TIMESTAMPING_OPT_CMSG, and sk_err is
always set in response to ICMP packets regardless of transmit
timestamps.

Since you're only checking the error queue upon a socket
error/exception, you will need to set IP_RECVERR to make sure the ICMP
packet is kept on the error queue. You wouldn't need that if you also
check getsockopt(SO_ERROR).

> Also, from the
> documentation I had an impression that SOF_TIMESTAMPING_OPT_CMSG is a
> no-op on AF_INET6 sockets.

No, it should work for both v4 and v6.

Thanks,
Soheil

> --
> Miroslav Lichvar


Re: [net-next V9 PATCH 00/16] XDP redirect memory return API

2018-04-03 Thread Jesper Dangaard Brouer
On Tue, 03 Apr 2018 10:54:27 -0400 (EDT)
David Miller  wrote:

> From: Jesper Dangaard Brouer 
> Date: Tue, 03 Apr 2018 13:07:36 +0200
> 
> > This is V9, but it's worth mentioning that V8 was send against
> > net-next, because i40e got XDP_REDIRECT support in-between V6, and it
> > doesn't exist in bpf-next yet.  Most significant change in V8 was that
> > page_pool only gets compiled into the kernel when a drivers Kconfig
> > 'select' the feature.  
> 
> Jesper, this series now looks good to me, however the net-next tree is
> closed at this point.

I noticed, but though that in-flight patchset's were still allowed...

> Don't worry, just resubmit when net-next opens back up.

At that point in time, should I got back to posting it against the
bpf-next git-tree again? Any preferences from Mellanox or BPF-guys?
... It have been a bit of a pain to keep track of driver changes in
net-next, and waiting for them to get merged into bpf-next.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH net v6 0/4] ipv6: udp: set dst cache for a connected sk if current not valid

2018-04-03 Thread Martin KaFai Lau
On Tue, Apr 03, 2018 at 03:00:06PM +0300, Alexey Kodanev wrote:
> A new RTF_CACHE route can be created with the socket's dst cache
> update between the below calls in udpv6_sendmsg(), when datagram
> sending results to ICMPV6_PKT_TOOBIG error:
> 
>dst = ip6_sk_dst_lookup_flow(...)
>...
> release_dst:
> if (dst) {
> if (connected) {
> ip6_dst_store(sk, dst)
> 
> Therefore, the new socket's dst cache reset to the old one on
> "release_dst:".
> 
> The first three patches prepare the code to store dst cache
> with ip6_sk_dst_lookup_flow():
> 
>   * the first patch adds ip6_sk_dst_store_flow() function with
> commonly used source and destiantion addresses checks using
> the flow information.
> 
>   * the second patch adds a new argument to ip6_sk_dst_lookup_flow()
> and ability to store dst in the socket's cache. Also, the two
> users of the function are updated without enabling the new
> behavior: pingv6_sendmsg() and udpv6_sendmsg().
> 
>   * the third patch makes 'connected' variable in udpv6_sendmsg()
> to be consistent with ip6_sk_dst_store_flow(), changes its type
> from int to bool.
> 
> The last patch contains the actual fix that removes sk dst cache
> update in the end of udpv6_sendmsg(), and allows to do it in
> ip6_sk_dst_lookup_flow().
Thanks for the patches!

Acked-by: Martin KaFai Lau 

> 
> v6: * use bool type for a new parameter in ip_sk_dst_lookup_flow()
> * add one more patch to convert 'connected' variable in
>   udpv6_sendmsg() from int to bool type. If it shouldn't be
>   here I will resend it when the net-next is opened.
> 
> v5: * relocate ip6_sk_dst_store_flow() to net/ipv6/route.c and
>   rename ip6_dst_store_flow() to ip6_sk_dst_store_flow() as
>   suggested by Martin
> 
> v4: * fix the error in the build of ip_dst_store_flow() reported by
>   kbuild test robot due to missing checks for CONFIG_IPV6: add
>   new function to ip6_output.c instead of ip6_route.h
> * add 'const' to struct flowi6 in ip6_dst_store_flow()
> * minor commit messages fixes
> 
> v3: * instead of moving ip6_dst_store() above udp_v6_send_skb(),
>   update socket's dst cache inside ip6_sk_dst_lookup_flow()
>   if the current one is invalid
> * the issue not reproduced in 4.1, but starting from 4.2. Add
>   one more 'Fixes:' commit that creates new RTF_CACHE route.
>   Though, it is also mentioned in the first one
> 
> Alexey Kodanev (4):
>   ipv6: add a wrapper for ip6_dst_store() with flowi6 checks
>   ipv6: allow to cache dst for a connected sk in ip6_sk_dst_lookup_flow()
>   ipv6: udp: convert 'connected' to bool type in udpv6_sendmsg()
>   ipv6: udp: set dst cache for a connected sk if current not valid
> 
>  include/net/ip6_route.h |  3 +++
>  include/net/ipv6.h  |  3 ++-
>  net/ipv6/datagram.c |  9 +
>  net/ipv6/ip6_output.c   | 15 ---
>  net/ipv6/ping.c |  2 +-
>  net/ipv6/route.c| 17 +
>  net/ipv6/udp.c  | 31 +++
>  7 files changed, 43 insertions(+), 37 deletions(-)
> 
> -- 
> 1.8.3.1
> 


Re: [PATCH net 2/2] net: systemport: Fix sparse warnings in bcm_sysport_insert_tsb()

2018-04-03 Thread Al Viro
On Mon, Apr 02, 2018 at 03:58:56PM -0700, Florian Fainelli wrote:
> skb->protocol is a __be16 which we would be calling htons() against,
> while this is not wrong per-se as it correctly results in swapping the
> value on LE hosts, this still upsets sparse. Adopt a similar pattern to
> what other drivers do and just assign ip_ver to skb->protocol, and then
> use htons() against the different constants such that the compiler can
> resolve the values at build time.

This is completely bogus.  What sparse is complaining about is htons used
to convert from big-endian to host-endian.  Which is what ntohs is for.
IOW, instead of all that crap just

-   ip_ver = htons(skb->protocol);
+   ip_ver = ntohs(skb->protocol);

and be done with that.  Same in other drivers with the same problem.



Re: [net-next V9 PATCH 00/16] XDP redirect memory return API

2018-04-03 Thread David Miller
From: Jesper Dangaard Brouer 
Date: Tue, 3 Apr 2018 18:07:16 +0200

> On Tue, 03 Apr 2018 10:54:27 -0400 (EDT)
> David Miller  wrote:
> 
>> Don't worry, just resubmit when net-next opens back up.
> 
> At that point in time, should I got back to posting it against the
> bpf-next git-tree again? Any preferences from Mellanox or BPF-guys?

I have no personal preference, although it's probably best to go
through the bpf-next tree.

> ... It have been a bit of a pain to keep track of driver changes in
> net-next, and waiting for them to get merged into bpf-next.

I totally understand :)


Re: [PATCH net 1/2] net: bcmgenet: Fix sparse warnings in bcmgenet_put_tx_csum()

2018-04-03 Thread Al Viro
On Mon, Apr 02, 2018 at 03:58:55PM -0700, Florian Fainelli wrote:
> skb->protocol is a __be16 which we would be calling htons() against,
> while this is not wrong per-se as it correctly results in swapping the
> value on LE hosts, this still upsets sparse. Adopt a similar pattern to
> what other drivers do and just assign ip_ver to skb->protocol, and then
> use htons() against the different constants such that the compiler can
> resolve the values at build time.

Fuck that and fuck other drivers sharing that "pattern".  It's not hard
to remember:
host-endian to net-endian short is htons
host-endian to net-endian long is htonl
net-endian to host-endian short is ntohs
net-endian to host-endian long is ntohl

That's *it*.  No convoluted changes needed, just use the right primitive.
You are turning trivial one-liners ("conversions between host- and net-endian
are the same in both directions, so the current code works even though we
are using the wrong primitive, confusing the readers.  Use the right one")
which are absolutely self-contained and safe into much more massive changes
that are harder to follow and which are *not* self-contained at all.

Don't do it.


Re: [PATCH net 1/2] net: bcmgenet: Fix sparse warnings in bcmgenet_put_tx_csum()

2018-04-03 Thread David Miller
From: Al Viro 
Date: Tue, 3 Apr 2018 17:29:33 +0100

> On Mon, Apr 02, 2018 at 03:58:55PM -0700, Florian Fainelli wrote:
>> skb->protocol is a __be16 which we would be calling htons() against,
>> while this is not wrong per-se as it correctly results in swapping the
>> value on LE hosts, this still upsets sparse. Adopt a similar pattern to
>> what other drivers do and just assign ip_ver to skb->protocol, and then
>> use htons() against the different constants such that the compiler can
>> resolve the values at build time.
> 
> Fuck that and fuck other drivers sharing that "pattern".  It's not hard
> to remember:
>   host-endian to net-endian short is htons
>   host-endian to net-endian long is htonl
>   net-endian to host-endian short is ntohs
>   net-endian to host-endian long is ntohl
> 
> That's *it*.  No convoluted changes needed, just use the right primitive.
> You are turning trivial one-liners ("conversions between host- and net-endian
> are the same in both directions, so the current code works even though we
> are using the wrong primitive, confusing the readers.  Use the right one")
> which are absolutely self-contained and safe into much more massive changes
> that are harder to follow and which are *not* self-contained at all.
> 
> Don't do it.

Yes Al, however the pattern choosen here is probably cheaper on little endian:

__beXX val = x;
switch (val) {
case htons(ETH_P_FOO):
 ...
}

This way only the compiler byte swaps the constants at compile time,
no code is actually generated to do it.


Re: [PATCH v15 ] net/veth/XDP: Line-rate packet forwarding in kernel

2018-04-03 Thread John Fastabend
On 04/03/2018 08:07 AM, David Ahern wrote:
> On 4/2/18 12:16 PM, Alexei Starovoitov wrote:
>> On Mon, Apr 02, 2018 at 12:09:44PM -0600, David Ahern wrote:
>>> On 4/2/18 12:03 PM, John Fastabend wrote:

 Can the above be a normal BPF helper that returns an
 ifindex? Then something roughly like this patter would
 work for all drivers with redirect support,


  route_ifindex = ip_route_lookup(__daddr, )
  if (!route_ifindex)
return do_foo()
  return xdp_redirect(route_ifindex);
  
 So my suggestion is,

   1. enable veth xdp (including redirect support)
   2. add a helper to lookup route from routing table

 Alternatively you can skip step (2) and encode the routing
 table in BPF directly. Maybe we need a more efficient data
 structure but that should also work.

>>>
>>> That's what I have here:
>>>
>>> https://github.com/dsahern/linux/commit/bab42f158c0925339f7519df7fb2cde8eac33aa8
>>
>> was wondering what's up with the delay and when are you going to
>> submit them officially...
>> The use case came up several times.
>>
> 
> I need to find time to come back to that set. As I recall there a number
> of outstanding issues:
> 
> 1. you and Daniel had comments about the bpf_func_proto declarations
> 
> 2. Jesper had concerns about xdp redirect to any netdev. e.g., How does
> the lookup know the egress netdev supports xdp? Right now you can try
> and the packet is dropped if it is not supported.
> 

There should probably be a tracepoint there if not already. Otherwise
I think the orchestration/loader layer should be ensuring that xdp
support is sufficient. I don't think we need anything specific in the
XDP/BPF code to handle unsupported devices.

> 3. VLAN devices. I suspect these will affect the final bpf function
> prototype. It would awkward to have 1 forwarding API for non-vlan
> devices and a second for vlan devices, hence the need to resolve this
> before it goes in.
> 

Interesting. Do we need stacked XDP, I could imagine having 802.1Q
simply call the lower dev XDP xmit routine. Possibly adding the 8021q
header first.

Or alternatively a new dev type could let users query things like
vlan-id from the dev rather than automatically doing the tagging. I
suspect though if you forward to a vlan device automatically adding
the tag is the correct behavior.


> 4. What about other stacked devices - bonds and bridges - will those
> just work with the bpf helper? VRF is already handled of course. ;-)
> 

So if we simply handle this like other stacked devices and call the
lower devs xdp_xmit routine we should get reasonable behavior. For
bonds and bridges I guess some generalization is needed though because
everything at the moment is skb centric. I don't think its necessary
in the first series though. It can be added later.

.John


Re: [PATCH v15 ] net/veth/XDP: Line-rate packet forwarding in kernel

2018-04-03 Thread David Miller
From: John Fastabend 
Date: Tue, 3 Apr 2018 09:41:08 -0700

> On 04/03/2018 08:07 AM, David Ahern wrote:
>> 4. What about other stacked devices - bonds and bridges - will those
>> just work with the bpf helper? VRF is already handled of course. ;-)
> 
> So if we simply handle this like other stacked devices and call the
> lower devs xdp_xmit routine we should get reasonable behavior. For
> bonds and bridges I guess some generalization is needed though because
> everything at the moment is skb centric. I don't think its necessary
> in the first series though. It can be added later.

I wonder if we need some feature bit gating this passthrough, just
like the various ->vlan_features and ->hw_enc_features.


Re: [PATCH net 1/2] net: bcmgenet: Fix sparse warnings in bcmgenet_put_tx_csum()

2018-04-03 Thread Al Viro
On Tue, Apr 03, 2018 at 12:33:05PM -0400, David Miller wrote:

> Yes Al, however the pattern choosen here is probably cheaper on little endian:
> 
>   __beXX val = x;
>   switch (val) {
>   case htons(ETH_P_FOO):
>...
>   }
> 
> This way only the compiler byte swaps the constants at compile time,
> no code is actually generated to do it.

That's not obvious, actually - depends upon how sparse the switch ends
up being.  You can easily lose more than a single byteswap insn on
worse cascase of comparisons.


  1   2   >