Re: [PATCH net] ebpf: fix bpf_msg_pull_data

2018-08-30 Thread Daniel Borkmann
On 08/30/2018 02:21 AM, Tushar Dave wrote:
> On 08/29/2018 05:07 PM, Tushar Dave wrote:
>> While doing some preliminary testing it is found that bpf helper
>> bpf_msg_pull_data does not calculate the data and data_end offset
>> correctly. Fix it!
>>
>> Fixes: 015632bb30da ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
>> Signed-off-by: Tushar Dave 
>> Acked-by: Sowmini Varadhan 
>> ---
>>   net/core/filter.c | 38 +-
>>   1 file changed, 25 insertions(+), 13 deletions(-)
>>
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index c25eb36..3eeb3d6 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -2285,7 +2285,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff 
>> *msg)
>>   BPF_CALL_4(bpf_msg_pull_data,
>>  struct sk_msg_buff *, msg, u32, start, u32, end, u64, flags)
>>   {
>> -    unsigned int len = 0, offset = 0, copy = 0;
>> +    unsigned int len = 0, offset = 0, copy = 0, off = 0;
>>   struct scatterlist *sg = msg->sg_data;
>>   int first_sg, last_sg, i, shift;
>>   unsigned char *p, *to, *from;
>> @@ -2299,22 +2299,30 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff 
>> *msg)
>>   i = msg->sg_start;
>>   do {
>>   len = sg[i].length;
>> -    offset += len;
>>   if (start < offset + len)
>>   break;
>> +    offset += len;
>>   i++;
>>   if (i == MAX_SKB_FRAGS)
>>   i = 0;
>> -    } while (i != msg->sg_end);
>> +    } while (i <= msg->sg_end);

I don't think this condition is correct, msg operates as a scatterlist ring,
so sg_end may very well be < current i when there's a wrap-around in the
traversal ... more below.

>>   +    /* return error if start is out of range */
>>   if (unlikely(start >= offset + len))
>>   return -EINVAL;
>>   -    if (!msg->sg_copy[i] && bytes <= len)
>> -    goto out;
>> +    /* return error if i is last entry in sglist and end is out of range */
>> +    if (msg->sg_copy[i] && end > offset + len)
>> +    return -EINVAL;
>>     first_sg = i;
>>   +    /* if i is not last entry in sg list and end (i.e start + bytes) is
>> + * within this sg[i] then goto out and calculate data and data_end
>> + */
>> +    if (!msg->sg_copy[i] && end <= offset + len)
>> +    goto out;
>> +
>>   /* At this point we need to linearize multiple scatterlist
>>    * elements or a single shared page. Either way we need to
>>    * copy into a linear buffer exclusively owned by BPF. Then
>> @@ -2330,9 +2338,14 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff 
>> *msg)
>>   i++;
>>   if (i == MAX_SKB_FRAGS)
>>   i = 0;
>> -    if (bytes < copy)
>> +    if (end < copy)
>>   break;
>> -    } while (i != msg->sg_end);
>> +    } while (i <= msg->sg_end);
>> +
>> +    /* return error if i is last entry in sglist and end is out of range */
>> +    if (i > msg->sg_end && end > offset + copy)
>> +    return -EINVAL;
>> +
>>   last_sg = i;
>>     if (unlikely(copy < end - start))
>> @@ -2342,23 +2355,22 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff 
>> *msg)
>>   if (unlikely(!page))
>>   return -ENOMEM;
>>   p = page_address(page);
>> -    offset = 0;
>>     i = first_sg;
>>   do {
>>   from = sg_virt(&sg[i]);
>>   len = sg[i].length;
>> -    to = p + offset;
>> +    to = p + off;
>>     memcpy(to, from, len);
>> -    offset += len;
>> +    off += len;
>>   sg[i].length = 0;
>>   put_page(sg_page(&sg[i]));
>>     i++;
>>   if (i == MAX_SKB_FRAGS)
>>   i = 0;
>> -    } while (i != last_sg);
>> +    } while (i < last_sg);
>>     sg[first_sg].length = copy;
>>   sg_set_page(&sg[first_sg], page, copy, 0);
>> @@ -2380,7 +2392,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff 
>> *msg)
>>   else
>>   move_from = i + shift;
>>   -    if (move_from == msg->sg_end)
>> +    if (move_from > msg->sg_end)
>>   break;
>>     sg[i] = sg[move_from];
>> @@ -2396,7 +2408,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff 
>> *msg)
>>   if (msg->sg_end < 0)
>>   msg->sg_end += MAX_SKB_FRAGS;
>>   out:
>> -    msg->data = sg_virt(&sg[i]) + start - offset;
>> +    msg->data = sg_virt(&sg[first_sg]) + start - offset;
>>   msg->data_end = msg->data + bytes;
>>     return 0;
>>
> 
> Please discard this patch. I just noticed that Daniel Borkmann sent some 
> similar fixes for bpf_msg_pull_data.

Yeah I've been looking at these recently as well, please make sure you test
with the below fixes included to see if there's anything left:

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/net/core/filter.c?id=5b24109b0563d45094c470684c1f8cea1af269f8
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/net/core/filter.c?id=0e06b227c

Re: [PATCH] ieee802154: mcr20a: read out of bounds in mcr20a_set_channel()

2018-08-30 Thread Dan Carpenter
On Wed, Aug 29, 2018 at 07:50:51PM +0200, Xue Liu wrote:
> Hello Dan,
> 
> 
> On Wed, 29 Aug 2018 at 16:49, Dan Carpenter 
> wrote:
> 
> > The "channel" variable can be any u8 value.  We need to make sure we
> > don't read outside of the PLL_INT[] or PLL_FRAC[] arrays.
> >
> I think the “channel” variable can not be any u8 value. This values is
> already checked before set_channel function is called.
> https://github.com/torvalds/linux/blob/master/net/ieee802154/nl802154.c#L978

Oh...  Hm...  I should have reviewed more carefully.  This patch isn't
correct.  But there may still be a bug.  What Smatch is worried about is
the other call tree:

ieee802154_start_req() calls (struct ieee802154_mlme_ops)->start_req
  -> mac802154_mlme_start_req()
 -> mac802154_dev_set_page_channel()
-> drv_set_channel() calls local->ops->set_channel(&local->hw, page, 
channel);
   -> mcr20a_set_channel()

So maybe we could move the check from nl802154_set_channel() to
drv_set_channel() so channel is checked on both call trees.

regards,
dan carpenter



[PATCH bpf-next] xsk: include XDP meta data in AF_XDP frames

2018-08-30 Thread Björn Töpel
From: Björn Töpel 

Previously, the AF_XDP (XDP_DRV/XDP_SKB copy-mode) ingress logic did
not include XDP meta data in the data buffers copied out to the user
application.

In this commit, we check if meta data is available, and if so, it is
prepended to the frame.

Signed-off-by: Björn Töpel 
---
 net/xdp/xsk.c | 37 -
 1 file changed, 28 insertions(+), 9 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 4e937cd7c17d..817e4cee1540 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -55,20 +55,30 @@ EXPORT_SYMBOL(xsk_umem_discard_addr);
 
 static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
 {
-   void *buffer;
+   void *to_buf, *from_buf;
+   u32 metalen;
u64 addr;
int err;
 
if (!xskq_peek_addr(xs->umem->fq, &addr) ||
-   len > xs->umem->chunk_size_nohr) {
+   len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
xs->rx_dropped++;
return -ENOSPC;
}
 
addr += xs->umem->headroom;
 
-   buffer = xdp_umem_get_data(xs->umem, addr);
-   memcpy(buffer, xdp->data, len);
+   if (xdp_data_meta_unsupported(xdp)) {
+   from_buf = xdp->data;
+   metalen = 0;
+   } else {
+   from_buf = xdp->data_meta;
+   metalen = xdp->data - xdp->data_meta;
+   }
+
+   to_buf = xdp_umem_get_data(xs->umem, addr);
+   memcpy(to_buf, from_buf, len + metalen);
+   addr += metalen;
err = xskq_produce_batch_desc(xs->rx, addr, len);
if (!err) {
xskq_discard_addr(xs->umem->fq);
@@ -111,8 +121,8 @@ void xsk_flush(struct xdp_sock *xs)
 
 int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
-   u32 len = xdp->data_end - xdp->data;
-   void *buffer;
+   u32 metalen, len = xdp->data_end - xdp->data;
+   void *to_buf, *from_buf;
u64 addr;
int err;
 
@@ -120,15 +130,24 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff 
*xdp)
return -EINVAL;
 
if (!xskq_peek_addr(xs->umem->fq, &addr) ||
-   len > xs->umem->chunk_size_nohr) {
+   len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
xs->rx_dropped++;
return -ENOSPC;
}
 
addr += xs->umem->headroom;
 
-   buffer = xdp_umem_get_data(xs->umem, addr);
-   memcpy(buffer, xdp->data, len);
+   if (xdp_data_meta_unsupported(xdp)) {
+   from_buf = xdp->data;
+   metalen = 0;
+   } else {
+   from_buf = xdp->data_meta;
+   metalen = xdp->data - xdp->data_meta;
+   }
+
+   to_buf = xdp_umem_get_data(xs->umem, addr);
+   memcpy(to_buf, from_buf, len + metalen);
+   addr += metalen;
err = xskq_produce_batch_desc(xs->rx, addr, len);
if (!err) {
xskq_discard_addr(xs->umem->fq);
-- 
2.17.1



Re: mlx5 driver loading failing on v4.19 / net-next / bpf-next

2018-08-30 Thread Tariq Toukan




On 29/08/2018 6:05 PM, Jesper Dangaard Brouer wrote:

Hi Saeed,

I'm having issues loading mlx5 driver on v4.19 kernels (tested both
net-next and bpf-next), while kernel v4.18 seems to work.  It happens
with a Mellanox ConnectX-5 NIC (and also a CX4-Lx but I removed that
from the system now).



Hi Jesper,

Thanks for your report!

We are working to analyze and debug the issue.


One pain point is very long boot-time, caused by some timeout code in
the driver. The kernel console log (dmesg) says:

[5.763330] mlx5_core :03:00.0: firmware version: 16.22.1002
[5.769367] mlx5_core :03:00.0: 126.016 Gb/s available PCIe bandwidth, 
limited by 8 GT/s x16 link at :00:02.0 (capable of 252.048 Gb/s with 16 
GT/s x16 link)

(...) other drivers loading

[   66.816635] mlx5_core :03:00.0: wait_func:964:(pid 112): 
ENABLE_HCA(0x104) timeout. Will cause a leak of a command resource
[   66.828123] mlx5_core :03:00.0: enable hca failed
[   66.845516] mlx5_core :03:00.0: mlx5_load_one failed with error code -110
[   66.852802] mlx5_core: probe of :03:00.0 failed with error -110

[   66.859347] mlx5_core :03:00.1: firmware version: 16.22.1002
[   66.865388] mlx5_core :03:00.1: 126.016 Gb/s available PCIe bandwidth, 
limited by 8 GT/s x16 link at :00:02.0 (capable of 252.048 Gb/s with 16 
GT/s x16 link)

[  125.787395] XFS (sda3): Mounting V5 Filesystem
[  125.848509] XFS (sda3): Ending clean mount
[  127.984784] mlx5_core :03:00.1: wait_func:964:(pid 5): ENABLE_HCA(0x104) 
timeout. Will cause a leak of a command resource
[  127.996090] mlx5_core :03:00.1: enable hca failed
[  128.013819] mlx5_core :03:00.1: mlx5_load_one failed with error code -110
[  128.021076] mlx5_core: probe of :03:00.1 failed with error -110


Do you have any idea what could be causing this?



We'll update regarding any progress.

Thanks!


Re: [PATCH 0/3] IB/ipoib: Use dev_port to disambiguate port numbers

2018-08-30 Thread Arseny Maslennikov
On Thu, Aug 30, 2018 at 08:43:30AM +0300, Leon Romanovsky wrote:
> On Wed, Aug 29, 2018 at 12:01:14AM +0300, Arseny Maslennikov wrote:
> > Pre-3.15 userspace had trouble distinguishing different ports
> > of a NIC on a single PCI bus/device/function. To solve this,
> > a sysfs field `dev_port' was introduced quite a while ago
> > (commit v3.14-rc3-739-g3f85944fe207), and some relevant device
> > drivers were fixed to use it, but not in case of IPoIB.
> >
> > The convention for some reason never got documented in the kernel, but
> > was immediately adopted by userspace (notably udev[1][2], biosdevname[3])
> >
> > 3/3 documents the sysfs field — that's why I'm CC-ing netdev.
> >
> > This series was tested on and applies to 4.19-rc1.
> >
> > [1] 
> > https://lists.freedesktop.org/archives/systemd-devel/2014-June/020788.html
> > [2] 
> > https://lists.freedesktop.org/archives/systemd-devel/2014-July/020804.html
> > [3] 
> > https://github.com/CloudAutomationNTools/biosdevname/blob/c795d51dd93a5309652f0d635f12a3ecfabfaa72/src/eths.c#L38
> >
> > Arseny Maslennikov (3):
> >   IB/ipoib: Use dev_port to expose network interface port numbers
> >   IB/ipoib: Stop using dev_id to expose port numbers
> 
> I completely agree with previous Yuval's comment, it makes no sense to
> start separate commits for every line.
> 
> Please decide what is best and right behavior and do it, instead of
> pushing it up to be the maintainer's problem.
> 

No problem; will squash those two in v2, then.

> Thanks
> 
> >   Documentation/ABI: document /sys/class/net/*/dev_port
> >
> >  Documentation/ABI/testing/sysfs-class-net | 10 ++
> >  drivers/infiniband/ulp/ipoib/ipoib_main.c |  2 +-
> >  2 files changed, 11 insertions(+), 1 deletion(-)
> >
> > --
> > 2.18.0
> >




signature.asc
Description: PGP signature


Re: [PATCH 0/4] clk-pmc-atom + r8169: Add ether_clk handling to fix suspend issues

2018-08-30 Thread Andy Shevchenko
On Wed, Aug 29, 2018 at 07:06:09PM +0200, Hans de Goede wrote:

> > What a nice stuff, thanks!
> > 
> > Reviewed-by: Andy Shevchenko 
> 
> Thank you (and also thank you for the other reviews)
> 
> > Btw, you probably better to refer to
> > https://bugzilla.kernel.org/show_bug.cgi?id=196861 which is dedicated for 
> > S0ix
> > issue.
> 
> Ok, I've added a link to that. I've also added:
> 
> Reported-by: Johannes Stezenbach 
> 
> To honor Johannes for figuring out that leaving the clocks enabled
> was a problem in the first place.
> 
> This will all be included in v2 of the series when I send it out.

Forgot to mention, instead of Irina, which is not longer working for Intel for
more than a year, put Pierre's address there.


-- 
With Best Regards,
Andy Shevchenko




Re: [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e

2018-08-30 Thread Björn Töpel
Den ons 29 aug. 2018 kl 18:12 skrev Daniel Borkmann :
>
> On 08/28/2018 02:44 PM, Björn Töpel wrote:
> > From: Björn Töpel 
> >
> > This patch set introduces zero-copy AF_XDP support for Intel's i40e
> > driver. In the first preparatory patch we also add support for
> > XDP_REDIRECT for zero-copy allocated frames so that XDP programs can
> > redirect them. This was a ToDo from the first AF_XDP zero-copy patch
> > set from early June. Special thanks to Alex Duyck and Jesper Dangaard
> > Brouer for reviewing earlier versions of this patch set.
> >
> > The i40e zero-copy code is located in its own file i40e_xsk.[ch]. Note
> > that in the interest of time, to get an AF_XDP zero-copy implementation
> > out there for people to try, some code paths have been copied from the
> > XDP path to the zero-copy path. It is out goal to merge the two paths
> > in later patch sets.
> >
> > In contrast to the implementation from beginning of June, this patch
> > set does not require any extra HW queues for AF_XDP zero-copy
> > TX. Instead, the XDP TX HW queue is used for both XDP_REDIRECT and
> > AF_XDP zero-copy TX.
> >
> > Jeff, given that most of changes are in i40e, it is up to you how you
> > would like to route these patches. The set is tagged bpf-next, but
> > if taking it via the Intel driver tree is easier, let us know.
> >
> > We have run some benchmarks on a dual socket system with two Broadwell
> > E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> > cores which gives a total of 28, but only two cores are used in these
> > experiments. One for TR/RX and one for the user space application. The
> > memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> > 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> > memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
> > NIC is Intel I40E 40Gbit/s using the i40e driver.
> >
> > Below are the results in Mpps of the I40E NIC benchmark runs for 64
> > and 1500 byte packets, generated by a commercial packet generator HW
> > outputing packets at full 40 Gbit/s line rate. The results are with
> > retpoline and all other spectre and meltdown fixes, so these results
> > are not comparable to the ones from the zero-copy patch set in June.
> >
> > AF_XDP performance 64 byte packets.
> > Benchmark   XDP_SKBXDP_DRVXDP_DRV with zerocopy
> > rxdrop   2.68.2 15.0
> > txpush   2.2-   21.9
> > l2fwd1.72.3 11.3
> >
> > AF_XDP performance 1500 byte packets:
> > Benchmark   XDP_SKB   XDP_DRV XDP_DRV with zerocopy
> > rxdrop   2.03.3 3.3
> > l2fwd1.31.7 3.1
> >
> > XDP performance on our system as a base line:
> >
> > 64 byte packets:
> > XDP stats   CPU pps issue-pps
> > XDP-RX CPU  16  18.4M  0
> >
> > 1500 byte packets:
> > XDP stats   CPU pps issue-pps
> > XDP-RX CPU  16  3.3M0
> >
> > The structure of the patch set is as follows:
> >
> > Patch 1: Add support for XDP_REDIRECT of zero-copy allocated frames
> > Patches 2-4: Preparatory patches to common xsk and net code
> > Patches 5-7: Preparatory patches to i40e driver code for RX
> > Patch 8: i40e zero-copy support for RX
> > Patch 9: Preparatory patch to i40e driver code for TX
> > Patch 10: i40e zero-copy support for TX
> > Patch 11: Add flags to sample application to force zero-copy/copy mode
> >
> > We based this patch set on bpf-next commit 050cdc6c9501 ("Merge
> > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
> >
> >
> > Magnus & Björn
> >
> > Björn Töpel (8):
> >   xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY
> >   xdp: export xdp_rxq_info_unreg_mem_model
> >   xsk: expose xdp_umem_get_{data,dma} to drivers
> >   i40e: added queue pair disable/enable functions
> >   i40e: refactor Rx path for re-use
> >   i40e: move common Rx functions to i40e_txrx_common.h
> >   i40e: add AF_XDP zero-copy Rx support
> >   samples/bpf: add -c/--copy -z/--zero-copy flags to xdpsock
> >
> > Magnus Karlsson (3):
> >   net: add napi_if_scheduled_mark_missed
> >   i40e: move common Tx functions to i40e_txrx_common.h
> >   i40e: add AF_XDP zero-copy Tx support
>
> Thanks for working on this, LGTM! Are you also planning to get ixgbe
> out after that?
>

Yes, the plan is to get ixgbe out as the next driver, but we'll focus
on the existing i40e issues first.


Thanks,
Björn

> For the series:
>
> Acked-by: Daniel Borkmann 
>
> Thanks,
> Daniel


[PATCH net-next] packet: add sockopt to ignore outgoing packets

2018-08-30 Thread Vincent Whitchurch
Currently, the only way to ignore outgoing packets on a packet socket is
via the BPF filter.  With MSG_ZEROCOPY, packets that are looped into
AF_PACKET are copied in dev_queue_xmit_nit(), and this copy happens even
if the filter run from packet_rcv() would reject them.  So the presence
of a packet socket on the interface takes away the benefits of
MSG_ZEROCOPY, even if the packet socket is not interested in outgoing
packets.  (Even when MSG_ZEROCOPY is not used, the skb is unnecessarily
cloned, but the cost for that is much lower.)

Add a socket option to allow AF_PACKET sockets to ignore outgoing
packets to solve this.  Note that the *BSDs already have something
similar: BIOCSSEESENT/BIOCSDIRECTION and BIOCSDIRFILT.

The first intended user is lldpd.

Signed-off-by: Vincent Whitchurch 
---
 include/linux/netdevice.h  |  1 +
 include/uapi/linux/if_packet.h |  1 +
 net/core/dev.c |  3 +++
 net/packet/af_packet.c | 15 +++
 4 files changed, 20 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ca5ab98053c8..8ef14d9edc58 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2317,6 +2317,7 @@ static inline struct sk_buff 
*call_gro_receive_sk(gro_receive_sk_t cb,
 
 struct packet_type {
__be16  type;   /* This is really htons(ether_type). */
+   boolignore_outgoing;
struct net_device   *dev;   /* NULL is wildcarded here   */
int (*func) (struct sk_buff *,
 struct net_device *,
diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
index 67b61d91d89b..467b654bd4c7 100644
--- a/include/uapi/linux/if_packet.h
+++ b/include/uapi/linux/if_packet.h
@@ -57,6 +57,7 @@ struct sockaddr_ll {
 #define PACKET_QDISC_BYPASS20
 #define PACKET_ROLLOVER_STATS  21
 #define PACKET_FANOUT_DATA 22
+#define PACKET_IGNORE_OUTGOING 23
 
 #define PACKET_FANOUT_HASH 0
 #define PACKET_FANOUT_LB   1
diff --git a/net/core/dev.c b/net/core/dev.c
index 325fc5088370..0addb4f0abfe 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1947,6 +1947,9 @@ static inline bool skb_loop_sk(struct packet_type *ptype, 
struct sk_buff *skb)
if (!ptype->af_packet_priv || !skb->sk)
return false;
 
+   if (ptype->ignore_outgoing)
+   return true;
+
if (ptype->id_match)
return ptype->id_match(ptype, skb->sk);
else if ((struct sock *)ptype->af_packet_priv == skb->sk)
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 5610061e7f2e..37bbafad724a 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -3805,6 +3805,18 @@ packet_setsockopt(struct socket *sock, int level, int 
optname, char __user *optv
 
return fanout_set_data(po, optval, optlen);
}
+   case PACKET_IGNORE_OUTGOING:
+   {
+   int val;
+
+   if (optlen != sizeof(val))
+   return -EINVAL;
+   if (copy_from_user(&val, optval, sizeof(val)))
+   return -EFAULT;
+
+   po->prot_hook.ignore_outgoing = !!val;
+   return 0;
+   }
case PACKET_TX_HAS_OFF:
{
unsigned int val;
@@ -3928,6 +3940,9 @@ static int packet_getsockopt(struct socket *sock, int 
level, int optname,
((u32)po->fanout->flags << 24)) :
   0);
break;
+   case PACKET_IGNORE_OUTGOING:
+   val = po->prot_hook.ignore_outgoing;
+   break;
case PACKET_ROLLOVER_STATS:
if (!po->rollover)
return -EINVAL;
-- 
2.11.0



Re: [PATCH] rtnetlink: expose value from SET_NETDEV_DEVTYPE via IFLA_DEVTYPE attribute

2018-08-30 Thread Marcel Holtmann
Hi Jiri,

 The name value from SET_NETDEV_DEVTYPE only ended up in the uevent sysfs
 file as DEVTYPE= information. To avoid any kind of race conditions
 between netlink messages and reading from sysfs, it is useful to add the
 same string as new IFLA_DEVTYPE attribute included in the RTM_NEWLINK
 messages.
 
 For network managing daemons that have to classify ARPHRD_ETHER network
 devices into different types (like Wireless LAN, Bluetooth etc.), this
 avoids the extra round trip to sysfs and parsing of the uevent file.
 
 Signed-off-by: Marcel Holtmann 
 ---
 include/uapi/linux/if_link.h |  2 ++
 net/core/rtnetlink.c | 12 
 2 files changed, 14 insertions(+)
 
 diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
 index 43391e2d1153..781294972bb4 100644
 --- a/include/uapi/linux/if_link.h
 +++ b/include/uapi/linux/if_link.h
 @@ -166,6 +166,8 @@ enum {
IFLA_NEW_IFINDEX,
IFLA_MIN_MTU,
IFLA_MAX_MTU,
 +  IFLA_DEVTYPE,   /* Name value from SET_NETDEV_DEVTYPE */
>>> 
>>> This is not something netdev-related. dev->dev.type is struct device_type.
>>> This is a generic "device" thing. Incorrect to expose over
>>> netdev-specific API. Please use "device" API for this.
>> 
>> it is not just "device" related since this is a sub-classification of 
>> ARPHRD_ETHER type as a wrote above. Don't get hang up that this information 
>> is part of struct device. The dev->dev.type contains strings like "wlan", 
>> "bluetooth", "wimax", "gadget" etc. so that you can tell what kind of 
>> Ethernet card you have.
> 
> I understand. But using dev->dev.type for that purpose seems like an
> abuse. If this is subtype of netdev, it should not be stored in parent
> device. I believe that you need to re-introduce this as a part of struct
> net_device - preferably an enum - and that you can expose over rtnl (and
> net-sysfs).

it is not stored in the parent device. It is stored in the device of the netdev.

As Stephen said, there is no device API. And why would I add the same info 
again and bloat netdev?

drivers/net/bonding/bond_main.c:SET_NETDEV_DEVTYPE(bond_dev, 
&bond_type);
drivers/net/geneve.c:   SET_NETDEV_DEVTYPE(dev, &geneve_type);
drivers/net/macsec.c:   SET_NETDEV_DEVTYPE(dev, &macsec_type);
drivers/net/ppp/ppp_generic.c:  SET_NETDEV_DEVTYPE(dev, &ppp_type);
drivers/net/usb/hso.c:  SET_NETDEV_DEVTYPE(net, &hso_type);
drivers/net/usb/usbnet.c:   SET_NETDEV_DEVTYPE(net, &wlan_type);
drivers/net/usb/usbnet.c:   SET_NETDEV_DEVTYPE(net, &wwan_type);
drivers/net/vxlan.c:SET_NETDEV_DEVTYPE(dev, &vxlan_type);
drivers/net/wimax/i2400m/usb.c: SET_NETDEV_DEVTYPE(net_dev, &i2400mu_type);
drivers/net/wireless/intersil/prism54/islpci_dev.c: 
SET_NETDEV_DEVTYPE(ndev, &wlan_type);
drivers/staging/gdm724x/gdm_lte.c:  SET_NETDEV_DEVTYPE(net, 
&wwan_type);
drivers/usb/gadget/function/u_ether.c:  SET_NETDEV_DEVTYPE(net, &gadget_type);
drivers/usb/gadget/function/u_ether.c:  SET_NETDEV_DEVTYPE(net, &gadget_type);
net/8021q/vlan_dev.c:   SET_NETDEV_DEVTYPE(dev, &vlan_type);
net/bluetooth/6lowpan.c:SET_NETDEV_DEVTYPE(netdev, &bt_type);
net/bluetooth/bnep/core.c:  SET_NETDEV_DEVTYPE(dev, &bnep_type);
net/bridge/br_device.c: SET_NETDEV_DEVTYPE(dev, &br_type);
net/dsa/slave.c:SET_NETDEV_DEVTYPE(slave_dev, &dsa_type);
net/hsr/hsr_device.c:   SET_NETDEV_DEVTYPE(dev, &hsr_type);
net/l2tp/l2tp_eth.c:SET_NETDEV_DEVTYPE(dev, &l2tpeth_type);
net/wireless/core.c:SET_NETDEV_DEVTYPE(dev, &wiphy_type);

So for all these devices we have to add duplicate information now?

I actually don't like that idea. I can rename it to IFLA_FOO or any other 
proposal, but adding the same information twice is pointless.

>> We can revive the patches for adding this information as IFLA_INFO_KIND, but 
>> last time they where reverted since NetworkManager did all the wrong 
>> decisions based on that. That part is fixed now and we can put that back and 
>> declare the sub-type classification of Ethernet device in two places. Or we 
>> just take the information that we added a long time ago for exactly this 
>> sub-classification and provide them to userspace via RTNL.
> 
> IFLA_INFO_KIND serves for different purpose. It is for drivers that
> register rtnl_link_ops. I don't think it would be wise to mix it with
> this.

We could register rtnl_link_ops for these Ethernet drivers, but then we are 
duplicating the information as well.

Regards

Marcel



Re: [PATCH bpf-next 11/11] samples/bpf: add -c/--copy -z/--zero-copy flags to xdpsock

2018-08-30 Thread Björn Töpel
Den ons 29 aug. 2018 kl 14:44 skrev Jesper Dangaard Brouer :
>
> On Tue, 28 Aug 2018 14:44:35 +0200
> Björn Töpel  wrote:
>
> > From: Björn Töpel 
> >
> > The -c/--copy -z/--zero-copy flags enforces either copy or zero-copy
> > mode.
>
> Nice, thanks for adding this.  It allows me to quickly test the
> difference between normal-copy vs zero-copy modes.
> (Kernel bpf-next without RETPOLINE).
>
> AF_XDP RX-drop:
>  Normal-copy mode: rx 13,070,318 pps - 76.5 ns
>  Zero-copy   mode: rx 26,132,328 pps - 38.3 ns
>
> Compare to XDP_DROP:  34,251,464 pps - 29.2 ns
>XDP_DROP + read :  30,756,664 pps - 32.5 ns
>
> The normal-copy mode is surprisingly fast (and it works for every
> driver implemeting the regular XDP_REDIRECT action).  It is still
> faster to do in-kernel XDP_DROP than AF_XDP zero-copy mode dropping,
> which was expected given frames travel to a remote CPU before returned
> (don't think remote CPU reads payload?).  The gap in nanosec is
> actually quite small, thus I'm impressed by the SPSC-queue
> implementation working across these CPUs.
>
>
> AF_XDP layer2-fwd:
>  Normal-copy mode: rx  3,200,885   tx  3,200,892
>  Zero-copy   mode: rx 17,026,300   tx 17,026,269
>
> Compare to XDP_TX: rx 14,529,079   tx 14,529,850  - 68.82 ns
>  XDP_REDIRECT: rx 13,235,785   tx 13,235,784  - 75.55 ns
>
> The copy-mode is slow because it allocates SKBs internally (I do
> wonder if we could speed it up by using ndo_xdp_xmit + disable-BH).
> More intersting is that the zero-copy is faster than XDP_TX and
> XDP_REDIRECT. I think the speedup comes from avoiding some DMA mapping
> calls with ZC.
>
> Side-note: XDP_TX vs. REDIRECT: 75.55 - 68.82 = 6.73 ns.  The cost of
> going through the xdp_do_redirect_map core is actually quite small :-)
> (I have some micro optimizations that should help ~2ns).
>
>
> AF_XDP TX-only:
>  Normal-copy mode: tx  2,853,461 pps
>  Zero-copy   mode: tx 22,255,311 pps
>
> (There is not XDP mode that does TX to compare against)
>

Kudos for doing the in-depth benchmarking!


Thanks!
Björn

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH] rtnetlink: expose value from SET_NETDEV_DEVTYPE via IFLA_DEVTYPE attribute

2018-08-30 Thread Jiri Pirko
Thu, Aug 30, 2018 at 12:13:40PM CEST, mar...@holtmann.org wrote:
>Hi Jiri,
>
> The name value from SET_NETDEV_DEVTYPE only ended up in the uevent sysfs
> file as DEVTYPE= information. To avoid any kind of race conditions
> between netlink messages and reading from sysfs, it is useful to add the
> same string as new IFLA_DEVTYPE attribute included in the RTM_NEWLINK
> messages.
> 
> For network managing daemons that have to classify ARPHRD_ETHER network
> devices into different types (like Wireless LAN, Bluetooth etc.), this
> avoids the extra round trip to sysfs and parsing of the uevent file.
> 
> Signed-off-by: Marcel Holtmann 
> ---
> include/uapi/linux/if_link.h |  2 ++
> net/core/rtnetlink.c | 12 
> 2 files changed, 14 insertions(+)
> 
> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
> index 43391e2d1153..781294972bb4 100644
> --- a/include/uapi/linux/if_link.h
> +++ b/include/uapi/linux/if_link.h
> @@ -166,6 +166,8 @@ enum {
>   IFLA_NEW_IFINDEX,
>   IFLA_MIN_MTU,
>   IFLA_MAX_MTU,
> + IFLA_DEVTYPE,   /* Name value from SET_NETDEV_DEVTYPE */
 
 This is not something netdev-related. dev->dev.type is struct device_type.
 This is a generic "device" thing. Incorrect to expose over
 netdev-specific API. Please use "device" API for this.
>>> 
>>> it is not just "device" related since this is a sub-classification of 
>>> ARPHRD_ETHER type as a wrote above. Don't get hang up that this information 
>>> is part of struct device. The dev->dev.type contains strings like "wlan", 
>>> "bluetooth", "wimax", "gadget" etc. so that you can tell what kind of 
>>> Ethernet card you have.
>> 
>> I understand. But using dev->dev.type for that purpose seems like an
>> abuse. If this is subtype of netdev, it should not be stored in parent
>> device. I believe that you need to re-introduce this as a part of struct
>> net_device - preferably an enum - and that you can expose over rtnl (and
>> net-sysfs).
>
>it is not stored in the parent device. It is stored in the device of the 
>netdev.

Yeah. That is what I ment. "Device struct" associated with the
netdevice.

>
>As Stephen said, there is no device API. And why would I add the same info 
>again and bloat netdev?
>
>drivers/net/bonding/bond_main.c:SET_NETDEV_DEVTYPE(bond_dev, 
>&bond_type);
>drivers/net/geneve.c:   SET_NETDEV_DEVTYPE(dev, &geneve_type);
>drivers/net/macsec.c:   SET_NETDEV_DEVTYPE(dev, &macsec_type);
>drivers/net/ppp/ppp_generic.c:  SET_NETDEV_DEVTYPE(dev, &ppp_type);
>drivers/net/usb/hso.c:  SET_NETDEV_DEVTYPE(net, &hso_type);
>drivers/net/usb/usbnet.c:   SET_NETDEV_DEVTYPE(net, &wlan_type);
>drivers/net/usb/usbnet.c:   SET_NETDEV_DEVTYPE(net, &wwan_type);
>drivers/net/vxlan.c:SET_NETDEV_DEVTYPE(dev, &vxlan_type);
>drivers/net/wimax/i2400m/usb.c: SET_NETDEV_DEVTYPE(net_dev, &i2400mu_type);
>drivers/net/wireless/intersil/prism54/islpci_dev.c: 
>SET_NETDEV_DEVTYPE(ndev, &wlan_type);
>drivers/staging/gdm724x/gdm_lte.c:  SET_NETDEV_DEVTYPE(net, 
>&wwan_type);
>drivers/usb/gadget/function/u_ether.c:  SET_NETDEV_DEVTYPE(net, &gadget_type);
>drivers/usb/gadget/function/u_ether.c:  SET_NETDEV_DEVTYPE(net, &gadget_type);
>net/8021q/vlan_dev.c:   SET_NETDEV_DEVTYPE(dev, &vlan_type);
>net/bluetooth/6lowpan.c:SET_NETDEV_DEVTYPE(netdev, &bt_type);
>net/bluetooth/bnep/core.c:  SET_NETDEV_DEVTYPE(dev, &bnep_type);
>net/bridge/br_device.c: SET_NETDEV_DEVTYPE(dev, &br_type);
>net/dsa/slave.c:SET_NETDEV_DEVTYPE(slave_dev, &dsa_type);
>net/hsr/hsr_device.c:   SET_NETDEV_DEVTYPE(dev, &hsr_type);
>net/l2tp/l2tp_eth.c:SET_NETDEV_DEVTYPE(dev, &l2tpeth_type);
>net/wireless/core.c:SET_NETDEV_DEVTYPE(dev, &wiphy_type);
>
>So for all these devices we have to add duplicate information now?

The fact the things are done now it a wrong way does not justify
extending UAPI in the same wrong way. So yes, change them all if you
need it. The "ethernet-subtype" should not an attribute of "struct
device" but rather "struct net_device".

>
>I actually don't like that idea. I can rename it to IFLA_FOO or any other 
>proposal, but adding the same information twice is pointless.

So don't expose the "struct device" attribute over rtnetlink. Use
"struct device"-specific interface for that.

>
>>> We can revive the patches for adding this information as IFLA_INFO_KIND, 
>>> but last time they where reverted since NetworkManager did all the wrong 
>>> decisions based on that. That part is fixed now and we can put that back 
>>> and declare the sub-type classification of Ethernet device in two places. 
>>> Or we just take the information that we added a long time ago for exactly 
>>> this sub-classification and provide them to userspace via RTNL.
>> 
>> IFLA_INFO_KIND serves for different purpose. It is for drivers that
>> register rtnl_link_ops. I

[RFT net-next] net: stmmac: Rework coalesce timer and fix multi-queue races

2018-08-30 Thread Jose Abreu
[ As for now this is only for testing! ]

This follows David Miller advice and tries to fix coalesce timer in
multi-queue scenarios.

We are now using per-queue coalesce values and per-queue TX timer. This
assumes that tx_queues == rx_queues, which can not be necessarly true.
Official patch will need to have this fixed.

Coalesce timer default values was changed to 1ms and the coalesce frames
to 25.

Tested in B2B setup between XGMAC2 and GMAC5.

Signed-off-by: Jose Abreu 
Cc: Jerome Brunet 
Cc: Martin Blumenstingl 
Cc: David S. Miller 
Cc: Joao Pinto 
Cc: Giuseppe Cavallaro 
Cc: Alexandre Torgue 
---
 drivers/net/ethernet/stmicro/stmmac/common.h  |   4 +-
 drivers/net/ethernet/stmicro/stmmac/stmmac.h  |   6 +-
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 117 +-
 3 files changed, 75 insertions(+), 52 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h 
b/drivers/net/ethernet/stmicro/stmmac/common.h
index 1854f270ad66..b1b305f8f414 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -258,10 +258,10 @@ struct stmmac_safety_stats {
 #define MAX_DMA_RIWT   0xff
 #define MIN_DMA_RIWT   0x20
 /* Tx coalesce parameters */
-#define STMMAC_COAL_TX_TIMER   4
+#define STMMAC_COAL_TX_TIMER   1000
 #define STMMAC_MAX_COAL_TX_TICK10
 #define STMMAC_TX_MAX_FRAMES   256
-#define STMMAC_TX_FRAMES   64
+#define STMMAC_TX_FRAMES   25
 
 /* Packets types */
 enum packets_types {
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h 
b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
index 76649adf8fb0..db42a5f03a5f 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
@@ -48,6 +48,9 @@ struct stmmac_tx_info {
 
 /* Frequently used values are kept adjacent for cache effect */
 struct stmmac_tx_queue {
+   u32 tx_count_frames;
+   int tx_timer_active;
+   struct timer_list txtimer;
u32 queue_index;
struct stmmac_priv *priv_data;
struct dma_extended_desc *dma_etx cacheline_aligned_in_smp;
@@ -109,15 +112,12 @@ struct stmmac_pps_cfg {
 
 struct stmmac_priv {
/* Frequently used values are kept adjacent for cache effect */
-   u32 tx_count_frames;
u32 tx_coal_frames;
u32 tx_coal_timer;
-   bool tx_timer_armed;
 
int tx_coalesce;
int hwts_tx_en;
bool tx_path_in_lpi_mode;
-   struct timer_list txtimer;
bool tso;
 
unsigned int dma_buf_sz;
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index ff1ffb46198a..ae26a6e8608e 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -2034,7 +2034,6 @@ static void stmmac_dma_interrupt(struct stmmac_priv *priv)
u32 channels_to_check = tx_channel_count > rx_channel_count ?
tx_channel_count : rx_channel_count;
u32 chan;
-   bool poll_scheduled = false;
int status[max_t(u32, MTL_MAX_TX_QUEUES, MTL_MAX_RX_QUEUES)];
 
/* Make sure we never check beyond our status buffer. */
@@ -2058,7 +2057,6 @@ static void stmmac_dma_interrupt(struct stmmac_priv *priv)
if (likely(napi_schedule_prep(&rx_q->napi))) {
stmmac_disable_dma_irq(priv, priv->ioaddr, 
chan);
__napi_schedule(&rx_q->napi);
-   poll_scheduled = true;
}
}
}
@@ -2067,21 +2065,13 @@ static void stmmac_dma_interrupt(struct stmmac_priv 
*priv)
 * If we didn't schedule poll, see if any DMA channel (used by tx) has a
 * completed transmission, if so, call stmmac_poll (once).
 */
-   if (!poll_scheduled) {
-   for (chan = 0; chan < tx_channel_count; chan++) {
-   if (status[chan] & handle_tx) {
-   /* It doesn't matter what rx queue we choose
-* here. We use 0 since it always exists.
-*/
-   struct stmmac_rx_queue *rx_q =
-   &priv->rx_queue[0];
+   for (chan = 0; chan < tx_channel_count; chan++) {
+   if (status[chan] & handle_tx) {
+   struct stmmac_rx_queue *rx_q = &priv->rx_queue[chan];
 
-   if (likely(napi_schedule_prep(&rx_q->napi))) {
-   stmmac_disable_dma_irq(priv,
-   priv->ioaddr, chan);
-   __napi_schedule(&rx_q->napi);
-   }
-   break;
+   if (likely(napi_schedule_prep(&rx_q->napi))) {
+  

[PATCH net-next 2/2] netlink: ipv6 MLD join notifications

2018-08-30 Thread Patrick Ruddy
Some userspace applications need to know about MLD joins from the
kernel for 2 reasons:
1. To allow the programming of multicast MAC filters in hardware
2. To form a multicast FORUS list for non link-local multicast
   groups to be sent to the kernel and from there to the interested
   party.
(1) can be fulfilled but simply sending the hardware multicast MAC
address to be programmed but (2) requires the L3 address to be sent
since this cannot be constructed from the MAC address whereas the
reverse translation is a standard library function.

This commit provides addition and deletion of multicast addresses
using the RTM_NEWADDR and RTM_DELADDR messages. It also provides
the RTM_GETADDR extension to allow multicast join state to be read
from the kernel.

Signed-off-by: Patrick Ruddy 
---
 net/ipv6/addrconf.c | 44 +-
 net/ipv6/mcast.c| 66 +
 2 files changed, 98 insertions(+), 12 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index d51a8c0b3372..e4e7362f9298 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4855,11 +4855,13 @@ static int inet6_fill_ifaddr(struct sk_buff *skb, 
struct inet6_ifaddr *ifa,
 }
 
 static int inet6_fill_ifmcaddr(struct sk_buff *skb, struct ifmcaddr6 *ifmca,
-   u32 portid, u32 seq, int event, u16 flags)
+   u32 portid, u32 seq, int event, u16 flags)
 {
struct nlmsghdr  *nlh;
u8 scope = RT_SCOPE_UNIVERSE;
int ifindex = ifmca->idev->dev->ifindex;
+   int addr_type = (event == RTM_GETMULTICAST) ? IFA_MULTICAST :
+   IFA_ADDRESS;
 
if (ipv6_addr_scope(&ifmca->mca_addr) & IFA_SITE)
scope = RT_SCOPE_SITE;
@@ -4869,7 +4871,7 @@ static int inet6_fill_ifmcaddr(struct sk_buff *skb, 
struct ifmcaddr6 *ifmca,
return -EMSGSIZE;
 
put_ifaddrmsg(nlh, 128, IFA_F_PERMANENT, scope, ifindex);
-   if (nla_put_in6_addr(skb, IFA_MULTICAST, &ifmca->mca_addr) < 0 ||
+   if (nla_put_in6_addr(skb, addr_type, &ifmca->mca_addr) < 0 ||
put_cacheinfo(skb, ifmca->mca_cstamp, ifmca->mca_tstamp,
  INFINITY_LIFE_TIME, INFINITY_LIFE_TIME) < 0) {
nlmsg_cancel(skb, nlh);
@@ -4916,7 +4918,7 @@ enum addr_type_t {
 /* called with rcu_read_lock() */
 static int in6_dump_addrs(struct inet6_dev *idev, struct sk_buff *skb,
  struct netlink_callback *cb, enum addr_type_t type,
- int s_ip_idx, int *p_ip_idx)
+ int s_ip_idx, int *p_ip_idx, int msg_type)
 {
struct ifmcaddr6 *ifmca;
struct ifacaddr6 *ifaca;
@@ -4935,7 +4937,7 @@ static int in6_dump_addrs(struct inet6_dev *idev, struct 
sk_buff *skb,
err = inet6_fill_ifaddr(skb, ifa,
NETLINK_CB(cb->skb).portid,
cb->nlh->nlmsg_seq,
-   RTM_NEWADDR,
+   msg_type,
NLM_F_MULTI);
if (err < 0)
break;
@@ -4952,7 +4954,7 @@ static int in6_dump_addrs(struct inet6_dev *idev, struct 
sk_buff *skb,
err = inet6_fill_ifmcaddr(skb, ifmca,
  NETLINK_CB(cb->skb).portid,
  cb->nlh->nlmsg_seq,
- RTM_GETMULTICAST,
+ msg_type,
  NLM_F_MULTI);
if (err < 0)
break;
@@ -4967,7 +4969,7 @@ static int in6_dump_addrs(struct inet6_dev *idev, struct 
sk_buff *skb,
err = inet6_fill_ifacaddr(skb, ifaca,
  NETLINK_CB(cb->skb).portid,
  cb->nlh->nlmsg_seq,
- RTM_GETANYCAST,
+ msg_type,
  NLM_F_MULTI);
if (err < 0)
break;
@@ -4982,7 +4984,7 @@ static int in6_dump_addrs(struct inet6_dev *idev, struct 
sk_buff *skb,
 }
 
 static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb,
-  enum addr_type_t type)
+  enum addr_type_t type, int msg_type)
 {
struct net *net = sock_net(skb->sk);
int h, s_h;
@@ -5012,7 +5014,7 @@ static int inet6_dump_addr(struct sk_buff *skb, struct 
netlink_callback *cb,
goto cont;
 
if (in6_dump_addrs(idev, skb, cb, type,
-

kernels >= v4.12 oops/crash with ipsec-traffic: partly bisected

2018-08-30 Thread Wolfgang Walter
Hello,

kernels >= 4.12 do not work on one of our main routers. They crash as soon as 
ipsec-tunnels are configured and ipsec-traffic actually flows.

Just configuring ipsec (that is starting strongswan) does not trigger the 
oops.

I finally found time to bisect that. Though I have not completed that yet, I 
already narrowed it down to the following commits

good: d24406c85d123df773bc4df88ad5da2233896919
udp: call dst_hold_safe() in udp_sk_rx_set_dst()
bad: 5b7c9a8ff828287af5aebe93e707271bf1a82cc3
net: remove dst gc related code

Commits in between are almost all changes to remove dst gc.

Now we have other machines which run just fine with the very same kernels 
doing ipsec. They differ insofar as they have much less cores, do not use the 
ixgbe driver, do not have 10G and terminate only a few tunnels instead of 
hundreds.

I already tested distribution kernels > 4.12 from debian, they also crash.

Regards,
-- 
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts


Re: [Patch net-nnext] net_sched: add missing tcf_lock for act_connmark

2018-08-30 Thread Vlad Buslov
On Wed 29 Aug 2018 at 17:15, Cong Wang  wrote:
> According to the new locking rule, we have to take tcf_lock
> for both ->init() and ->dump(), as RTNL will be removed.
> However, it is missing for act_connmark.

Thank you for finding and fixing this!

>
> Cc: Vlad Buslov 
> Signed-off-by: Cong Wang 
> ---
>  net/sched/act_connmark.c | 11 ---
>  1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/net/sched/act_connmark.c b/net/sched/act_connmark.c
> index e869c0ee63c8..8475913f2070 100644
> --- a/net/sched/act_connmark.c
> +++ b/net/sched/act_connmark.c
> @@ -143,8 +143,10 @@ static int tcf_connmark_init(struct net *net, struct 
> nlattr *nla,
>   return -EEXIST;
>   }
>   /* replacing action and zone */
> + spin_lock_bh(&ci->tcf_lock);
>   ci->tcf_action = parm->action;
>   ci->zone = parm->zone;
> + spin_unlock_bh(&ci->tcf_lock);
>   ret = 0;
>   }
>  
> @@ -156,16 +158,16 @@ static inline int tcf_connmark_dump(struct sk_buff 
> *skb, struct tc_action *a,
>  {
>   unsigned char *b = skb_tail_pointer(skb);
>   struct tcf_connmark_info *ci = to_connmark(a);
> -
>   struct tc_connmark opt = {
>   .index   = ci->tcf_index,
>   .refcnt  = refcount_read(&ci->tcf_refcnt) - ref,
>   .bindcnt = atomic_read(&ci->tcf_bindcnt) - bind,
> - .action  = ci->tcf_action,
> - .zone   = ci->zone,
>   };
>   struct tcf_t t;
>  
> + spin_lock_bh(&ci->tcf_lock);
> + opt.action = ci->tcf_action;
> + opt.zone = ci->zone;
>   if (nla_put(skb, TCA_CONNMARK_PARMS, sizeof(opt), &opt))
>   goto nla_put_failure;
>  
> @@ -173,9 +175,12 @@ static inline int tcf_connmark_dump(struct sk_buff *skb, 
> struct tc_action *a,
>   if (nla_put_64bit(skb, TCA_CONNMARK_TM, sizeof(t), &t,
> TCA_CONNMARK_PAD))
>   goto nla_put_failure;
> + spin_unlock_bh(&ci->tcf_lock);
>  
>   return skb->len;
> +
>  nla_put_failure:
> + spin_unlock_bh(&ci->tcf_lock);
>   nlmsg_trim(skb, b);
>   return -1;
>  }



[PATCH net-next 1/2] netlink: ipv4 IGMP join notifications

2018-08-30 Thread Patrick Ruddy
Some userspace applications need to know about IGMP joins from the kernel
for 2 reasons
1. To allow the programming of multicast MAC filters in hardware
2. To form a multicast FORUS list for non link-local multicast
   groups to be sent to the kernel and from there to the interested
   party.
(1) can be fulfilled but simply sending the hardware multicast MAC
address to be programmed but (2) requires the L3 address to be sent
since this cannot be constructed from the MAC address whereas the
reverse translation is a standard library function.

This commit provides addition and deletion of multicast addresses
using the RTM_NEWADDR and RTM_DELADDR messages. It also provides
the RTM_GETADDR extension to allow multicast join state to be read
from the kernel.

Signed-off-by: Patrick Ruddy 
---
 include/linux/igmp.h |  2 +
 net/ipv4/devinet.c   | 39 +--
 net/ipv4/igmp.c  | 90 
 3 files changed, 120 insertions(+), 11 deletions(-)

diff --git a/include/linux/igmp.h b/include/linux/igmp.h
index 119f53941c12..1fb417865e7d 100644
--- a/include/linux/igmp.h
+++ b/include/linux/igmp.h
@@ -130,6 +130,8 @@ extern void ip_mc_unmap(struct in_device *);
 extern void ip_mc_remap(struct in_device *);
 extern void ip_mc_dec_group(struct in_device *in_dev, __be32 addr);
 extern void ip_mc_inc_group(struct in_device *in_dev, __be32 addr);
+extern int ip_mc_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb,
+struct net_device *dev);
 int ip_mc_check_igmp(struct sk_buff *skb, struct sk_buff **skb_trimmed);
 
 #endif
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index ea4bd8a52422..42f7dcc4fb5e 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -57,6 +57,7 @@
 #endif
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1651,6 +1652,7 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct 
netlink_callback *cb)
int h, s_h;
int idx, s_idx;
int ip_idx, s_ip_idx;
+   int multicast, mcast_idx;
struct net_device *dev;
struct in_device *in_dev;
struct in_ifaddr *ifa;
@@ -1659,6 +1661,8 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct 
netlink_callback *cb)
s_h = cb->args[0];
s_idx = idx = cb->args[1];
s_ip_idx = ip_idx = cb->args[2];
+   multicast = cb->args[3];
+   mcast_idx = cb->args[4];
 
for (h = s_h; h < NETDEV_HASHENTRIES; h++, s_idx = 0) {
idx = 0;
@@ -1675,18 +1679,29 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct 
netlink_callback *cb)
if (!in_dev)
goto cont;
 
-   for (ifa = in_dev->ifa_list, ip_idx = 0; ifa;
-ifa = ifa->ifa_next, ip_idx++) {
-   if (ip_idx < s_ip_idx)
-   continue;
-   if (inet_fill_ifaddr(skb, ifa,
-NETLINK_CB(cb->skb).portid,
-cb->nlh->nlmsg_seq,
-RTM_NEWADDR, NLM_F_MULTI) < 0) {
-   rcu_read_unlock();
-   goto done;
+   if (!multicast) {
+   for (ifa = in_dev->ifa_list, ip_idx = 0; ifa;
+ifa = ifa->ifa_next, ip_idx++) {
+   if (ip_idx < s_ip_idx)
+   continue;
+   if (inet_fill_ifaddr(skb, ifa,
+
NETLINK_CB(cb->skb).portid,
+cb->nlh->nlmsg_seq,
+RTM_NEWADDR,
+NLM_F_MULTI) < 0) {
+   rcu_read_unlock();
+   goto done;
+   }
+   nl_dump_check_consistent(cb,
+
nlmsg_hdr(skb));
}
-   nl_dump_check_consistent(cb, nlmsg_hdr(skb));
+   /* set for multicast loop */
+   multicast++;
+   }
+   /* loop over multicast addresses */
+   if (ip_mc_dump_ifaddr(skb, cb, dev) < 0) {
+   rcu_read_unlock();
+   goto done;
}
 cont:
idx++;
@@ -1698,6 +1713,8 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct 
netlink_callback *cb)
cb->args[0]

Re: [PATCH bpf-next 08/11] i40e: add AF_XDP zero-copy Rx support

2018-08-30 Thread Björn Töpel

On 2018-08-29 21:14, Jakub Kicinski wrote:
> On Tue, 28 Aug 2018 14:44:32 +0200, Björn Töpel wrote:
>> From: Björn Töpel 
>>
>> This patch adds zero-copy Rx support for AF_XDP sockets. Instead of
>> allocating buffers of type MEM_TYPE_PAGE_SHARED, the Rx frames are
>> allocated as MEM_TYPE_ZERO_COPY when AF_XDP is enabled for a certain
>> queue.
>>
>> All AF_XDP specific functions are added to a new file, i40e_xsk.c.
>>
>> Note that when AF_XDP zero-copy is enabled, the XDP action XDP_PASS
>> will allocate a new buffer and copy the zero-copy frame prior passing
>> it to the kernel stack.
>>
>> Signed-off-by: Björn Töpel 
>
> Mm.. I'm surprised you don't run into buffer reuse issues that I had
> when playing with AF_XDP.  What happens in i40e if someone downs the
> interface?  Will UMEMs get destroyed?  Will the RX buffers get freed?
>

The UMEM will linger in the driver until the sockets are dead.

> I'll shortly send an RFC with my quick and dirty RX buffer reuse queue,
> FWIW.
>

Some background for folks that don't know the details: A zero-copy
capable driver picks buffers off the fill ring and places them on the
hardware Rx ring to be completed at a later point when DMA is
complete. Analogous for the Tx side; The driver picks buffers off the
Tx ring and places them on the Tx hardware ring.

In the typical flow, the Rx buffer will be placed onto an Rx ring
(completed to the user), and the Tx buffer will be placed on the
completion ring to notify the user that the transfer is done.

However, if the driver needs to tear down the hardware rings for some
reason (interface goes down, reconfiguration and such), what should be
done with the Rx and Tx buffers that has been given to the driver?

So, to frame the problem: What should a driver do when this happens,
so that buffers aren't leaked?

Note that when the UMEM is going down, there's no need to complete
anything, since the sockets are dying/dead already.

This is, as you state, a missing piece in the implementation and needs
to be fixed.

Now on to possible solutions:

1. Complete the buffers back to the user. For Tx, this is probably the
   best way -- just place the buffers onto the completion ring.

   For Rx, we can give buffers back to user space by setting the
   length in the Rx descriptor to zero And putting them on the Rx
   ring. However, one complication here is that we do not have any
   back-pressure mechanism for the Rx side like we have on Tx. If the
   Rx ring(s) is (are) full the kernel will have to leak them or
   implement a retry mechanism (ugly and should be avoided).

   Another option to solve this without needing any retry or leaking
   for Rx is to implement the same back-pressure mechanism that we
   have on the Tx path in the Rx path. In the Tx path, the driver will
   only get a Tx packet to send if there is space for it in the
   completion ring. On Rx, this would be that the driver would only
   get a buffer from the fill ring if there is space for it to put it
   on the Rx ring. The drawback of this is that it would likely impact
   performance negatively since the Rx ring would have to be touch one
   more time (in the Tx path, it increased performance since it made
   it possible to implement the Tx path without any buffering), but it
   would guarantee that all buffers can always be returned to user
   space making solution this a viable option.

2. Store the buffers internally in the driver, and make sure that they
   are inserted into the "normal flow" again. For Rx that would be
   putting the buffers back into the allocation scheme that the driver
   is using. For Tx, placing the buffers back onto the Tx HW ring
   (plus all the logic for making sure that all corner cases work).

3. Mark the socket(s) as in error state, en require the user to redo
   the setup. This is bit harsh...

For i40e I think #2 for Rx (buffers reside in kernel, return to
allocator) and #1 for Tx (complete to userland).

Your RFC is plumbing to implement #2 for Rx in a driver. I'm not a fan
of extending the umem with the "reuse queue". This decision is really
up the driver. Some driver might prefer another scheme, or simply
prefer storing the buffers in another manner.

Looking forward, as both you and Jesper has alluded to, we need a
proper allocator for zero-copy. Then it would be a matter of injecting
the Rx buffers back to the allocator.

I'll send out a patch, so that we don't leak the buffers, but I'd like
to hear your thoughts on what the behavior should be.

And what should the behavior be when the netdev is removed from the
kernel?

And thanks for looking into the code!


Björn


[PATCH net] tcp: do not restart timewait timer on rst reception

2018-08-30 Thread Florian Westphal
RFC 1337 says:
 ''Ignore RST segments in TIME-WAIT state.
   If the 2 minute MSL is enforced, this fix avoids all three hazards.''

So with net.ipv4.tcp_rfc1337=1, expected behaviour is to have TIME-WAIT sk
expire rather than removing it instantly when a reset is received.

However, Linux will also re-start the TIME-WAIT timer.

This causes connect to fail when tying to re-use ports or very long
delays (until syn retry interval exceeds MSL).

packetdrill test case:
// Demonstrate bogus rearming of TIME-WAIT timer in rfc1337 mode.
`sysctl net.ipv4.tcp_rfc1337=1`

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0

0.100 < S 0:0(0) win 29200 
0.100 > S. 0:0(0) ack 1 
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4

// Receive first segment
0.310 < P. 1:1001(1000) ack 1 win 46

// Send one ACK
0.310 > . 1:1(0) ack 1001

// read 1000 byte
0.310 read(4, ..., 1000) = 1000

// Application writes 100 bytes
0.350 write(4, ..., 100) = 100
0.350 > P. 1:101(100) ack 1001

// ACK
0.500 < . 1001:1001(0) ack 101 win 257

// close the connection
0.600 close(4) = 0
0.600 > F. 101:101(0) ack 1001 win 244

// Our side is in FIN_WAIT_1 & waits for ack to fin
0.7 < . 1001:1001(0) ack 102 win 244

// Our side is in FIN_WAIT_2 with no outstanding data.
0.8 < F. 1001:1001(0) ack 102 win 244
0.8 > . 102:102(0) ack 1002 win 244

// Our side is now in TIME_WAIT state, send ack for fin.
0.9 < F. 1002:1002(0) ack 102 win 244
0.9 > . 102:102(0) ack 1002 win 244

// Peer reopens with in-window SYN:
1.000 < S 1000:1000(0) win 9200 

// Therefore, reply with ACK.
1.000 > . 102:102(0) ack 1002 win 244

// Peer sends RST for this ACK.  Normally this RST results
// in tw socket removal, but rfc1337=1 setting prevents this.
1.100 < R 1002:1002(0) win 244

// second syn. Due to rfc1337=1 expect another pure ACK.
31.0 < S 1000:1000(0) win 9200 
31.0 > . 102:102(0) ack 1002 win 244

// .. and another RST from peer.
31.1 < R 1002:1002(0) win 244
31.2 `echo no timer restart;ss -m -e -a -i -n -t -o state TIME-WAIT`

// third syn after one minute.  Time-Wait socket should have expired by now.
63.0 < S 1000:1000(0) win 9200 

// so we expect a syn-ack & 3whs to proceed from here on.
63.0 > S. 0:0(0) ack 1 

Without this patch, 'ss' shows restarts of tw timer and last packet is
thus just another pure ack, more than one minute later.

This restores the original code from commit 283fd6cf0be690a83
("Merge in ANK networking jumbo patch") in netdev-vger-cvs.git .

For some reason the else branch was removed/lost in 1f28b683339f7
("Merge in TCP/UDP optimizations and [..]") and timer restart became
unconditional.

Reported-by: Michal Tesar 
Signed-off-by: Florian Westphal 
---
 net/ipv4/tcp_minisocks.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 1dda1341a223..b690132f5da2 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -184,8 +184,9 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, 
struct sk_buff *skb,
inet_twsk_deschedule_put(tw);
return TCP_TW_SUCCESS;
}
+   } else {
+   inet_twsk_reschedule(tw, TCP_TIMEWAIT_LEN);
}
-   inet_twsk_reschedule(tw, TCP_TIMEWAIT_LEN);
 
if (tmp_opt.saw_tstamp) {
tcptw->tw_ts_recent   = tmp_opt.rcv_tsval;
-- 
2.16.4



Re: [PATCH bpf-next 0/3] bpf: implement percpu map pretty print for bpffs and bpftool

2018-08-30 Thread Daniel Borkmann
On 08/29/2018 11:43 PM, Yonghong Song wrote:
> Commit a26ca7c982cb ("bpf: btf: Add pretty print support to the
> basic arraymap") and Commit 699c86d6ec21 ("bpf: btf: add pretty print
> for hash/lru_hash maps") added bpffs pretty print for array, hash and
> lru hash maps. The pretty print gives users a structurally formatted
> dump for keys/values which much easy to understand than raw bytes.
> 
> This patch set implemented bpffs pretty print support for
> percpu arraymap, percpu hashmap and percpu lru hashmap.
> For complex key/value types, the pretty print here is even more useful
> due to
>   . large volumne of data making it even harder to correlate bytes
> to a particular field in a particular cpu.
>   . kernel rounds the value size for each cpu to multiple of 8.
> User has to be aware of this otherwise wrong value may be
> derived from cpu 1/2/...
> 
> For example, we may have a bpffs pretty print like below:
>43602: {
> cpu0: 
> {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
> cpu1: 
> {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
> cpu2: 
> {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
> cpu3: 
> {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
>}
> for a percpu map.
> 
> This patch also added percpu formatted print on bpftool. For example,
> bpftool may print like below:
> {
> "key": 0,
> "values": [{
> "cpu": 0,
> "value": {
> "ui32": 0,
> "ui16": 0,
> }
> },{
> "cpu": 1,
> "value": {
> "ui32": 1,
> "ui16": 0,
> }
> },{
> "cpu": 2,
> "value": {
> "ui32": 2,
> "ui16": 0,
> }
> },{
> "cpu": 3,
> "value": {
> "ui32": 3,
> "ui16": 0,
> }
> }
> ]
> }
> 
> Patch #1 implemented bpffs pretty print for percpu arraymap/hash/lru_hash
> in kernel. Patch #2 added the test case in tools bpf selftest test_btf.
> Patch #3 added percpu map btf based dump.
> 
> Yonghong Song (3):
>   bpf: add bpffs pretty print for percpu arraymap/hash/lru_hash
>   tools/bpf: add bpffs percpu map pretty print tests in test_btf
>   tools/bpf: bpftool: add btf percpu map formated dump
> 
>  kernel/bpf/arraymap.c  |  24 +
>  kernel/bpf/hashtab.c   |  31 ++
>  tools/bpf/bpftool/map.c|  33 +-
>  tools/testing/selftests/bpf/test_btf.c | 179 
> ++---
>  4 files changed, 230 insertions(+), 37 deletions(-)

Applied to bpf-next, thanks Yonghong!


Re: [PATCH bpf-next] xsk: include XDP meta data in AF_XDP frames

2018-08-30 Thread Daniel Borkmann
On 08/30/2018 10:09 AM, Björn Töpel wrote:
> From: Björn Töpel 
> 
> Previously, the AF_XDP (XDP_DRV/XDP_SKB copy-mode) ingress logic did
> not include XDP meta data in the data buffers copied out to the user
> application.
> 
> In this commit, we check if meta data is available, and if so, it is
> prepended to the frame.
> 
> Signed-off-by: Björn Töpel 
> ---
>  net/xdp/xsk.c | 37 -
>  1 file changed, 28 insertions(+), 9 deletions(-)
> 
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 4e937cd7c17d..817e4cee1540 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -55,20 +55,30 @@ EXPORT_SYMBOL(xsk_umem_discard_addr);
>  
>  static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
>  {
> - void *buffer;
> + void *to_buf, *from_buf;
> + u32 metalen;
>   u64 addr;
>   int err;
>  
>   if (!xskq_peek_addr(xs->umem->fq, &addr) ||
> - len > xs->umem->chunk_size_nohr) {
> + len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
>   xs->rx_dropped++;
>   return -ENOSPC;
>   }
>  
>   addr += xs->umem->headroom;
>  
> - buffer = xdp_umem_get_data(xs->umem, addr);
> - memcpy(buffer, xdp->data, len);
> + if (xdp_data_meta_unsupported(xdp)) {

Nit: Probably makes sense to wrap with unlikely() since we expect all
drivers to implement this.

> + from_buf = xdp->data;
> + metalen = 0;
> + } else {
> + from_buf = xdp->data_meta;
> + metalen = xdp->data - xdp->data_meta;
> + }
> +
> + to_buf = xdp_umem_get_data(xs->umem, addr);
> + memcpy(to_buf, from_buf, len + metalen);
> + addr += metalen;
>   err = xskq_produce_batch_desc(xs->rx, addr, len);
>   if (!err) {
>   xskq_discard_addr(xs->umem->fq);
> @@ -111,8 +121,8 @@ void xsk_flush(struct xdp_sock *xs)
>  
>  int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
>  {
> - u32 len = xdp->data_end - xdp->data;
> - void *buffer;
> + u32 metalen, len = xdp->data_end - xdp->data;
> + void *to_buf, *from_buf;
>   u64 addr;
>   int err;
>  
> @@ -120,15 +130,24 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct 
> xdp_buff *xdp)
>   return -EINVAL;
>  
>   if (!xskq_peek_addr(xs->umem->fq, &addr) ||
> - len > xs->umem->chunk_size_nohr) {
> + len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
>   xs->rx_dropped++;
>   return -ENOSPC;
>   }
>  
>   addr += xs->umem->headroom;
>  
> - buffer = xdp_umem_get_data(xs->umem, addr);
> - memcpy(buffer, xdp->data, len);
> + if (xdp_data_meta_unsupported(xdp)) {
> + from_buf = xdp->data;
> + metalen = 0;

Note that this condition should be dead code. netif_receive_generic_xdp()
sets xdp->data_meta to xdp->data, so all good here and above is never hit.

> + } else {
> + from_buf = xdp->data_meta;
> + metalen = xdp->data - xdp->data_meta;
> + }
> +
> + to_buf = xdp_umem_get_data(xs->umem, addr);
> + memcpy(to_buf, from_buf, len + metalen);
> + addr += metalen;
>   err = xskq_produce_batch_desc(xs->rx, addr, len);
>   if (!err) {
>   xskq_discard_addr(xs->umem->fq);
> 



Re: [PATCH bpf-next] xsk: include XDP meta data in AF_XDP frames

2018-08-30 Thread Björn Töpel
Den tors 30 aug. 2018 kl 14:37 skrev Daniel Borkmann :
>
> On 08/30/2018 10:09 AM, Björn Töpel wrote:
> > From: Björn Töpel 
> >
> > Previously, the AF_XDP (XDP_DRV/XDP_SKB copy-mode) ingress logic did
> > not include XDP meta data in the data buffers copied out to the user
> > application.
> >
> > In this commit, we check if meta data is available, and if so, it is
> > prepended to the frame.
> >
> > Signed-off-by: Björn Töpel 
> > ---
> >  net/xdp/xsk.c | 37 -
> >  1 file changed, 28 insertions(+), 9 deletions(-)
> >
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 4e937cd7c17d..817e4cee1540 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -55,20 +55,30 @@ EXPORT_SYMBOL(xsk_umem_discard_addr);
> >
> >  static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
> >  {
> > - void *buffer;
> > + void *to_buf, *from_buf;
> > + u32 metalen;
> >   u64 addr;
> >   int err;
> >
> >   if (!xskq_peek_addr(xs->umem->fq, &addr) ||
> > - len > xs->umem->chunk_size_nohr) {
> > + len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
> >   xs->rx_dropped++;
> >   return -ENOSPC;
> >   }
> >
> >   addr += xs->umem->headroom;
> >
> > - buffer = xdp_umem_get_data(xs->umem, addr);
> > - memcpy(buffer, xdp->data, len);
> > + if (xdp_data_meta_unsupported(xdp)) {
>
> Nit: Probably makes sense to wrap with unlikely() since we expect all
> drivers to implement this.
>

I'll send a v2 fixing this...

> > + from_buf = xdp->data;
> > + metalen = 0;
> > + } else {
> > + from_buf = xdp->data_meta;
> > + metalen = xdp->data - xdp->data_meta;
> > + }
> > +
> > + to_buf = xdp_umem_get_data(xs->umem, addr);
> > + memcpy(to_buf, from_buf, len + metalen);
> > + addr += metalen;
> >   err = xskq_produce_batch_desc(xs->rx, addr, len);
> >   if (!err) {
> >   xskq_discard_addr(xs->umem->fq);
> > @@ -111,8 +121,8 @@ void xsk_flush(struct xdp_sock *xs)
> >
> >  int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
> >  {
> > - u32 len = xdp->data_end - xdp->data;
> > - void *buffer;
> > + u32 metalen, len = xdp->data_end - xdp->data;
> > + void *to_buf, *from_buf;
> >   u64 addr;
> >   int err;
> >
> > @@ -120,15 +130,24 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct 
> > xdp_buff *xdp)
> >   return -EINVAL;
> >
> >   if (!xskq_peek_addr(xs->umem->fq, &addr) ||
> > - len > xs->umem->chunk_size_nohr) {
> > + len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
> >   xs->rx_dropped++;
> >   return -ENOSPC;
> >   }
> >
> >   addr += xs->umem->headroom;
> >
> > - buffer = xdp_umem_get_data(xs->umem, addr);
> > - memcpy(buffer, xdp->data, len);
> > + if (xdp_data_meta_unsupported(xdp)) {
> > + from_buf = xdp->data;
> > + metalen = 0;
>
> Note that this condition should be dead code. netif_receive_generic_xdp()
> sets xdp->data_meta to xdp->data, so all good here and above is never hit.
>

...and this! Thanks for pointing this out!


Björn

> > + } else {
> > + from_buf = xdp->data_meta;
> > + metalen = xdp->data - xdp->data_meta;
> > + }
> > +
> > + to_buf = xdp_umem_get_data(xs->umem, addr);
> > + memcpy(to_buf, from_buf, len + metalen);
> > + addr += metalen;
> >   err = xskq_produce_batch_desc(xs->rx, addr, len);
> >   if (!err) {
> >   xskq_discard_addr(xs->umem->fq);
> >
>


[PATCH 2/2] vti6: do not check for ignore_df in order to update pmtu

2018-08-30 Thread Thadeu Lima de Souza Cascardo
Before commit d6990976af7c5d8f55903bfb4289b6fb030bf754 ("vti6: fix PMTU caching
and reporting on xmit"), skb was scrubbed before checking for ignore_df. The
scrubbing meant ignore_df was false, making the check irrelevant. Now that the
scrubbing happens after that, some packets might fail the checking and dst
will not have its pmtu updated.

Not only that, but too big skb will be potentially passed down to
__xfrm6_output, causing it to fail to transmit but not free the skb, causing a
leak of skb, and consequentially a leak of dst references.

After running pmtu.sh, that shows as failure to unregister devices in a 
namespace:

[  311.397671] unregister_netdevice: waiting for veth_b to become free. Usage 
count = 1

Signed-off-by: Thadeu Lima de Souza Cascardo 
---
 net/ipv6/ip6_vti.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
index c72ae3a4fe09..fbd3752ea587 100644
--- a/net/ipv6/ip6_vti.c
+++ b/net/ipv6/ip6_vti.c
@@ -481,7 +481,7 @@ vti6_xmit(struct sk_buff *skb, struct net_device *dev, 
struct flowi *fl)
}
 
mtu = dst_mtu(dst);
-   if (!skb->ignore_df && skb->len > mtu) {
+   if (skb->len > mtu) {
skb_dst_update_pmtu(skb, mtu);
 
if (skb->protocol == htons(ETH_P_IPV6)) {
-- 
2.17.1



[PATCH 1/2] xfrm6: call kfree_skb when skb is toobig

2018-08-30 Thread Thadeu Lima de Souza Cascardo
After commit d6990976af7c5d8f55903bfb4289b6fb030bf754 ("vti6: fix PMTU caching
and reporting on xmit"), some too big skbs might be potentially passed down to
__xfrm6_output, causing it to fail to transmit but not free the skb, causing a
leak of skb, and consequentially a leak of dst references.

After running pmtu.sh, that shows as failure to unregister devices in a 
namespace:

[  311.397671] unregister_netdevice: waiting for veth_b to become free. Usage 
count = 1

The fix is to call kfree_skb in case of transmit failures.

Signed-off-by: Thadeu Lima de Souza Cascardo 
---
 net/ipv6/xfrm6_output.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv6/xfrm6_output.c b/net/ipv6/xfrm6_output.c
index 5959ce9620eb..6a74080005cf 100644
--- a/net/ipv6/xfrm6_output.c
+++ b/net/ipv6/xfrm6_output.c
@@ -170,9 +170,11 @@ static int __xfrm6_output(struct net *net, struct sock 
*sk, struct sk_buff *skb)
 
if (toobig && xfrm6_local_dontfrag(skb)) {
xfrm6_local_rxpmtu(skb, mtu);
+   kfree_skb(skb);
return -EMSGSIZE;
} else if (!skb->ignore_df && toobig && skb->sk) {
xfrm_local_error(skb, mtu);
+   kfree_skb(skb);
return -EMSGSIZE;
}
 
-- 
2.17.1



[PATCH bpf-next v2] xsk: include XDP meta data in AF_XDP frames

2018-08-30 Thread Björn Töpel
From: Björn Töpel 

Previously, the AF_XDP (XDP_DRV/XDP_SKB copy-mode) ingress logic did
not include XDP meta data in the data buffers copied out to the user
application.

In this commit, we check if meta data is available, and if so, it is
prepended to the frame.

Signed-off-by: Björn Töpel 
---
 net/xdp/xsk.c | 24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 4e937cd7c17d..569048e299df 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -55,20 +55,30 @@ EXPORT_SYMBOL(xsk_umem_discard_addr);
 
 static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
 {
-   void *buffer;
+   void *to_buf, *from_buf;
+   u32 metalen;
u64 addr;
int err;
 
if (!xskq_peek_addr(xs->umem->fq, &addr) ||
-   len > xs->umem->chunk_size_nohr) {
+   len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
xs->rx_dropped++;
return -ENOSPC;
}
 
addr += xs->umem->headroom;
 
-   buffer = xdp_umem_get_data(xs->umem, addr);
-   memcpy(buffer, xdp->data, len);
+   if (unlikely(xdp_data_meta_unsupported(xdp))) {
+   from_buf = xdp->data;
+   metalen = 0;
+   } else {
+   from_buf = xdp->data_meta;
+   metalen = xdp->data - xdp->data_meta;
+   }
+
+   to_buf = xdp_umem_get_data(xs->umem, addr);
+   memcpy(to_buf, from_buf, len + metalen);
+   addr += metalen;
err = xskq_produce_batch_desc(xs->rx, addr, len);
if (!err) {
xskq_discard_addr(xs->umem->fq);
@@ -111,6 +121,7 @@ void xsk_flush(struct xdp_sock *xs)
 
 int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
+   u32 metalen = xdp->data - xdp->data_meta;
u32 len = xdp->data_end - xdp->data;
void *buffer;
u64 addr;
@@ -120,7 +131,7 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff 
*xdp)
return -EINVAL;
 
if (!xskq_peek_addr(xs->umem->fq, &addr) ||
-   len > xs->umem->chunk_size_nohr) {
+   len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
xs->rx_dropped++;
return -ENOSPC;
}
@@ -128,7 +139,8 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff 
*xdp)
addr += xs->umem->headroom;
 
buffer = xdp_umem_get_data(xs->umem, addr);
-   memcpy(buffer, xdp->data, len);
+   memcpy(buffer, xdp->data_meta, len + metalen);
+   addr += metalen;
err = xskq_produce_batch_desc(xs->rx, addr, len);
if (!err) {
xskq_discard_addr(xs->umem->fq);
-- 
2.17.1



Re: [PATCH 1/2] xfrm6: call kfree_skb when skb is toobig

2018-08-30 Thread Sabrina Dubroca
2018-08-30, 09:58:16 -0300, Thadeu Lima de Souza Cascardo wrote:
> After commit d6990976af7c5d8f55903bfb4289b6fb030bf754 ("vti6: fix PMTU caching
> and reporting on xmit"), some too big skbs might be potentially passed down to
> __xfrm6_output, causing it to fail to transmit but not free the skb, causing a
> leak of skb, and consequentially a leak of dst references.
> 
> After running pmtu.sh, that shows as failure to unregister devices in a 
> namespace:
> 
> [  311.397671] unregister_netdevice: waiting for veth_b to become free. Usage 
> count = 1
> 
> The fix is to call kfree_skb in case of transmit failures.
> 
> Signed-off-by: Thadeu Lima de Souza Cascardo 

Reviewed-by: Sabrina Dubroca 

I was about to post the same patch. Arguably, the commit introducing
this bug is the one that added those "return -EMSGSIZE" to
__xfrm6_output without freeing.

Either way, it's missing a Fixes: tag, which should be one of those,
or both:

Fixes: d6990976af7c ("vti6: fix PMTU caching and reporting on xmit")
Fixes: dd767856a36e ("xfrm6: Don't call icmpv6_send on local error")

-- 
Sabrina


Re: [PATCH bpf-next v2] xsk: include XDP meta data in AF_XDP frames

2018-08-30 Thread Daniel Borkmann
On 08/30/2018 03:12 PM, Björn Töpel wrote:
> From: Björn Töpel 
> 
> Previously, the AF_XDP (XDP_DRV/XDP_SKB copy-mode) ingress logic did
> not include XDP meta data in the data buffers copied out to the user
> application.
> 
> In this commit, we check if meta data is available, and if so, it is
> prepended to the frame.
> 
> Signed-off-by: Björn Töpel 

Applied to bpf-next, thanks Björn!


Re: [PATCH 00/15] soc: octeontx2: Add RVU admin function driver

2018-08-30 Thread Andrew Lunn
> > > My feeling overall is that we need a review from the network driver
> > > folks more than the arm-soc team etc, and that maybe the driver
> > > as a whole should go into drivers/net/ethernet.
> >
> > This driver doesn't handle any network IO and moreever this driver has to 
> > handle
> > configuration requests from crypto driver as well. There will be
> > separate network and
> > crypto drivers which will be upstreamed into drivers/net/ethernet and
> > drivers/crypto.
> > And in future silicons there will be different types of functional
> > blocks which will be
> > added into this resource virtualization unit (RVU). Hence i thought
> > this driver is not a
> > right fit in drivers/net/ethernet.

Hi Sunil

Do you have a git branch for everything? I would like to look at the
actual Ethernet driver, and the full API this driver exports to other
drivers.

I think there real question here is, do you have split between this
driver and the actual device drivers in the right place? For me, link
up/down detection should be in the Ethernet driver, since it is not
shared with the crypto driver.

   Thanks
Andrew


[PATCH bpf-next 0/3] xsk: misc code cleanup

2018-08-30 Thread Magnus Karlsson
This patch set cleans up two code style issues with the xsk zero-copy
code. The resulting code is smaller and simpler.

Patch 1: Removes a potential compiler warning reported by the Intel
 0-DAY kernel test infrastructure.
Patches 2-3: Removes the xdp_umem_props structure. At some point, it
 was used to break a dependency, but the members are these
 days much better off in the xdp_umem since the dependency
 does not exist anymore.

I based this patch set on bpf-next commit 234dbe3dc1db ("Merge branch
'verifier-liveness-simplification'")

Thanks: Magnus

Magnus Karlsson (3):
  i40e: fix possible compiler warning in xsk TX path
  xsk: get rid of useless struct xdp_umem_props
  i40e: adapt driver to new xdp_umem structure

 drivers/net/ethernet/intel/i40e/i40e_xsk.c | 10 --
 include/net/xdp_sock.h |  8 ++--
 net/xdp/xdp_umem.c |  4 ++--
 net/xdp/xdp_umem_props.h   | 14 --
 net/xdp/xsk.c  | 10 ++
 net/xdp/xsk_queue.c|  5 +++--
 net/xdp/xsk_queue.h| 13 +++--
 7 files changed, 24 insertions(+), 40 deletions(-)
 delete mode 100644 net/xdp/xdp_umem_props.h

--
2.7.4


[PATCH bpf-next 2/3] xsk: get rid of useless struct xdp_umem_props

2018-08-30 Thread Magnus Karlsson
This commit gets rid of the structure xdp_umem_props. It was there to
be able to break a dependency at one point, but this is no longer
needed. The values in the struct are instead stored directly in the
xdp_umem structure. This simplifies the xsk code as well as af_xdp
zero-copy drivers and as a bonus gets rid of one internal header file.

Signed-off-by: Magnus Karlsson 
---
 include/net/xdp_sock.h   |  8 ++--
 net/xdp/xdp_umem.c   |  4 ++--
 net/xdp/xdp_umem_props.h | 14 --
 net/xdp/xsk.c| 10 ++
 net/xdp/xsk_queue.c  |  5 +++--
 net/xdp/xsk_queue.h  | 13 +++--
 6 files changed, 20 insertions(+), 34 deletions(-)
 delete mode 100644 net/xdp/xdp_umem_props.h

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 56994ad..932ca0d 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -16,11 +16,6 @@
 struct net_device;
 struct xsk_queue;
 
-struct xdp_umem_props {
-   u64 chunk_mask;
-   u64 size;
-};
-
 struct xdp_umem_page {
void *addr;
dma_addr_t dma;
@@ -30,7 +25,8 @@ struct xdp_umem {
struct xsk_queue *fq;
struct xsk_queue *cq;
struct xdp_umem_page *pages;
-   struct xdp_umem_props props;
+   u64 chunk_mask;
+   u64 size;
u32 headroom;
u32 chunk_size_nohr;
struct user_struct *user;
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index bfe2dbe..2471614 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -314,8 +314,8 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct 
xdp_umem_reg *mr)
 
umem->pid = get_task_pid(current, PIDTYPE_PID);
umem->address = (unsigned long)addr;
-   umem->props.chunk_mask = ~((u64)chunk_size - 1);
-   umem->props.size = size;
+   umem->chunk_mask = ~((u64)chunk_size - 1);
+   umem->size = size;
umem->headroom = headroom;
umem->chunk_size_nohr = chunk_size - headroom;
umem->npgs = size / PAGE_SIZE;
diff --git a/net/xdp/xdp_umem_props.h b/net/xdp/xdp_umem_props.h
deleted file mode 100644
index 40eab10..000
--- a/net/xdp/xdp_umem_props.h
+++ /dev/null
@@ -1,14 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/* XDP user-space packet buffer
- * Copyright(c) 2018 Intel Corporation.
- */
-
-#ifndef XDP_UMEM_PROPS_H_
-#define XDP_UMEM_PROPS_H_
-
-struct xdp_umem_props {
-   u64 chunk_mask;
-   u64 size;
-};
-
-#endif /* XDP_UMEM_PROPS_H_ */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 4e937cd..acbe5b5 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -458,8 +458,10 @@ static int xsk_bind(struct socket *sock, struct sockaddr 
*addr, int addr_len)
goto out_unlock;
} else {
/* This xsk has its own umem. */
-   xskq_set_umem(xs->umem->fq, &xs->umem->props);
-   xskq_set_umem(xs->umem->cq, &xs->umem->props);
+   xskq_set_umem(xs->umem->fq, xs->umem->size,
+ xs->umem->chunk_mask);
+   xskq_set_umem(xs->umem->cq, xs->umem->size,
+ xs->umem->chunk_mask);
 
err = xdp_umem_assign_dev(xs->umem, dev, qid, flags);
if (err)
@@ -469,8 +471,8 @@ static int xsk_bind(struct socket *sock, struct sockaddr 
*addr, int addr_len)
xs->dev = dev;
xs->zc = xs->umem->zc;
xs->queue_id = qid;
-   xskq_set_umem(xs->rx, &xs->umem->props);
-   xskq_set_umem(xs->tx, &xs->umem->props);
+   xskq_set_umem(xs->rx, xs->umem->size, xs->umem->chunk_mask);
+   xskq_set_umem(xs->tx, xs->umem->size, xs->umem->chunk_mask);
xdp_add_sk_umem(xs->umem, xs);
 
 out_unlock:
diff --git a/net/xdp/xsk_queue.c b/net/xdp/xsk_queue.c
index 6c32e92..2dc1384d 100644
--- a/net/xdp/xsk_queue.c
+++ b/net/xdp/xsk_queue.c
@@ -7,12 +7,13 @@
 
 #include "xsk_queue.h"
 
-void xskq_set_umem(struct xsk_queue *q, struct xdp_umem_props *umem_props)
+void xskq_set_umem(struct xsk_queue *q, u64 size, u64 chunk_mask)
 {
if (!q)
return;
 
-   q->umem_props = *umem_props;
+   q->size = size;
+   q->chunk_mask = chunk_mask;
 }
 
 static u32 xskq_umem_get_ring_size(struct xsk_queue *q)
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 8a64b15..82252cc 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -31,7 +31,8 @@ struct xdp_umem_ring {
 };
 
 struct xsk_queue {
-   struct xdp_umem_props umem_props;
+   u64 chunk_mask;
+   u64 size;
u32 ring_mask;
u32 nentries;
u32 prod_head;
@@ -78,7 +79,7 @@ static inline u32 xskq_nb_free(struct xsk_queue *q, u32 
producer, u32 dcnt)
 
 static inline bool xskq_is_valid_addr(struct xsk_queue *q, u64 addr)
 {
-   if (addr >= q->umem_props.size) {
+   if (addr >= q->size) {
q->invalid_descs++;
return false;
}
@@ -92,7 +93,7 @@ static inline u64 *xskq_validate_addr(struct xsk_queue *q, 
u64 *addr)
str

[PATCH bpf-next 1/3] i40e: fix possible compiler warning in xsk TX path

2018-08-30 Thread Magnus Karlsson
With certain gcc versions, it was possible to get the warning
"'tx_desc' may be used uninitialized in this function" for the
i40e_xmit_zc. This was not possible, however this commit simplifies
the code path so that this warning is no longer emitted.

Signed-off-by: Magnus Karlsson 
---
 drivers/net/ethernet/intel/i40e/i40e_xsk.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c 
b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 94947a8..41ca7e1 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -668,9 +668,8 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int 
budget)
  **/
 static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
 {
-   unsigned int total_packets = 0;
+   struct i40e_tx_desc *tx_desc = NULL;
struct i40e_tx_buffer *tx_bi;
-   struct i40e_tx_desc *tx_desc;
bool work_done = true;
dma_addr_t dma;
u32 len;
@@ -697,14 +696,13 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, 
unsigned int budget)
build_ctob(I40E_TX_DESC_CMD_ICRC
   | I40E_TX_DESC_CMD_EOP,
   0, len, 0);
-   total_packets++;
 
xdp_ring->next_to_use++;
if (xdp_ring->next_to_use == xdp_ring->count)
xdp_ring->next_to_use = 0;
}
 
-   if (total_packets > 0) {
+   if (tx_desc) {
/* Request an interrupt for the last frame and bump tail ptr. */
tx_desc->cmd_type_offset_bsz |= (I40E_TX_DESC_CMD_RS <<
 I40E_TXD_QW1_CMD_SHIFT);
-- 
2.7.4



[PATCH bpf-next 3/3] i40e: adapt driver to new xdp_umem structure

2018-08-30 Thread Magnus Karlsson
The struct xdp_umem_props was removed in the xsk code and this commit
adapts the i40e af_xdp zero-copy driver code to the new xdp_umem
structure.

Signed-off-by: Magnus Karlsson 
---
 drivers/net/ethernet/intel/i40e/i40e_xsk.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c 
b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 41ca7e1..2ebfc78 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -442,7 +442,7 @@ static void i40e_reuse_rx_buffer_zc(struct i40e_ring 
*rx_ring,
struct i40e_rx_buffer *old_bi)
 {
struct i40e_rx_buffer *new_bi = &rx_ring->rx_bi[rx_ring->next_to_alloc];
-   unsigned long mask = (unsigned long)rx_ring->xsk_umem->props.chunk_mask;
+   unsigned long mask = (unsigned long)rx_ring->xsk_umem->chunk_mask;
u64 hr = rx_ring->xsk_umem->headroom + XDP_PACKET_HEADROOM;
u16 nta = rx_ring->next_to_alloc;
 
@@ -477,7 +477,7 @@ void i40e_zca_free(struct zero_copy_allocator *alloc, 
unsigned long handle)
 
rx_ring = container_of(alloc, struct i40e_ring, zca);
hr = rx_ring->xsk_umem->headroom + XDP_PACKET_HEADROOM;
-   mask = rx_ring->xsk_umem->props.chunk_mask;
+   mask = rx_ring->xsk_umem->chunk_mask;
 
nta = rx_ring->next_to_alloc;
bi = &rx_ring->rx_bi[nta];
-- 
2.7.4



[PATCH net 1/2] selftests: pmtu: maximum MTU for vti4 is 2^16-1-20

2018-08-30 Thread Sabrina Dubroca
Since commit 82612de1c98e ("ip_tunnel: restore binding to ifaces with a
large mtu"), the maximum MTU for vti4 is based on IP_MAX_MTU instead of
the mysterious constant 0xFFF8.  This makes this selftest fail.

Fixes: 82612de1c98e ("ip_tunnel: restore binding to ifaces with a large mtu")
Signed-off-by: Sabrina Dubroca 
Acked-by: Stefano Brivio 
---
 tools/testing/selftests/net/pmtu.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/pmtu.sh 
b/tools/testing/selftests/net/pmtu.sh
index f8cc38afffa2..0ecf2609b9a4 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -334,7 +334,7 @@ test_pmtu_vti4_link_add_mtu() {
fail=0
 
min=68
-   max=$((65528 - 20))
+   max=$((65535 - 20))
# Check invalid values first
for v in $((min - 1)) $((max + 1)); do
${ns_a} ip link add vti4_a mtu ${v} type vti local 
${veth4_a_addr} remote ${veth4_b_addr} key 10 2>/dev/null
-- 
2.18.0



[PATCH net 2/2] selftests: pmtu: detect correct binary to ping ipv6 addresses

2018-08-30 Thread Sabrina Dubroca
Some systems don't have the ping6 binary anymore, and use ping for
everything. Detect the absence of ping6 and try to use ping instead.

Fixes: d1f1b9cbf34c ("selftests: net: Introduce first PMTU test")
Signed-off-by: Sabrina Dubroca 
Acked-by: Stefano Brivio 
---
 tools/testing/selftests/net/pmtu.sh | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/pmtu.sh 
b/tools/testing/selftests/net/pmtu.sh
index 0ecf2609b9a4..cc2798a0a2d7 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -46,6 +46,9 @@
 # Kselftest framework requirement - SKIP code is 4.
 ksft_skip=4
 
+# Some systems don't have a ping6 binary anymore
+which ping6 > /dev/null 2>&1 && ping6=$(which ping6) || ping6=$(which ping)
+
 tests="
pmtu_vti6_exception vti6: PMTU exceptions
pmtu_vti4_exception vti4: PMTU exceptions
@@ -274,7 +277,7 @@ test_pmtu_vti6_exception() {
mtu "${ns_b}" veth_b 4000
mtu "${ns_a}" vti6_a 5000
mtu "${ns_b}" vti6_b 5000
-   ${ns_a} ping6 -q -i 0.1 -w 2 -s 6 ${vti6_b_addr} > /dev/null
+   ${ns_a} ${ping6} -q -i 0.1 -w 2 -s 6 ${vti6_b_addr} > /dev/null
 
# Check that exception was created
if [ "$(route_get_dst_pmtu_from_exception "${ns_a}" ${vti6_b_addr})" = 
"" ]; then
-- 
2.18.0



[PATCH net-next] net: stmmac: Add CBS support in XGMAC2

2018-08-30 Thread Jose Abreu
XGMAC2 uses the same CBS mechanism as GMAC5, only registers offset
changes. Lets use the same TC callbacks and implement the .config_cbs
callback in XGMAC2 core.

Signed-off-by: Jose Abreu 
Cc: David S. Miller 
Cc: Joao Pinto 
Cc: Giuseppe Cavallaro 
Cc: Alexandre Torgue 
---
 drivers/net/ethernet/stmicro/stmmac/dwxgmac2.h  | 12 
 drivers/net/ethernet/stmicro/stmmac/dwxgmac2_core.c | 19 ++-
 drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c  | 19 +++
 drivers/net/ethernet/stmicro/stmmac/hwif.c  |  2 +-
 4 files changed, 50 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwxgmac2.h 
b/drivers/net/ethernet/stmicro/stmmac/dwxgmac2.h
index 0a80fa25afe3..d6bb953685fa 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwxgmac2.h
+++ b/drivers/net/ethernet/stmicro/stmmac/dwxgmac2.h
@@ -119,11 +119,23 @@
 #define XGMAC_MTL_TXQ_OPMODE(x)(0x1100 + (0x80 * (x)))
 #define XGMAC_TQS  GENMASK(25, 16)
 #define XGMAC_TQS_SHIFT16
+#define XGMAC_Q2TCMAP  GENMASK(10, 8)
+#define XGMAC_Q2TCMAP_SHIFT8
 #define XGMAC_TTC  GENMASK(6, 4)
 #define XGMAC_TTC_SHIFT4
 #define XGMAC_TXQENGENMASK(3, 2)
 #define XGMAC_TXQEN_SHIFT  2
 #define XGMAC_TSF  BIT(1)
+#define XGMAC_MTL_TCx_ETS_CONTROL(x)   (0x1110 + (0x80 * (x)))
+#define XGMAC_MTL_TCx_QUANTUM_WEIGHT(x)(0x1118 + (0x80 * (x)))
+#define XGMAC_MTL_TCx_SENDSLOPE(x) (0x111c + (0x80 * (x)))
+#define XGMAC_MTL_TCx_HICREDIT(x)  (0x1120 + (0x80 * (x)))
+#define XGMAC_MTL_TCx_LOCREDIT(x)  (0x1124 + (0x80 * (x)))
+#define XGMAC_CC   BIT(3)
+#define XGMAC_TSA  GENMASK(1, 0)
+#define XGMAC_SP   (0x0 << 0)
+#define XGMAC_CBS  (0x1 << 0)
+#define XGMAC_ETS  (0x2 << 0)
 #define XGMAC_MTL_RXQ_OPMODE(x)(0x1140 + (0x80 * (x)))
 #define XGMAC_RQS  GENMASK(25, 16)
 #define XGMAC_RQS_SHIFT16
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_core.c 
b/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_core.c
index d182f82f7b58..64b8cb88ea45 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_core.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_core.c
@@ -177,6 +177,23 @@ static void dwxgmac2_map_mtl_to_dma(struct mac_device_info 
*hw, u32 queue,
writel(value, ioaddr + reg);
 }
 
+static void dwxgmac2_config_cbs(struct mac_device_info *hw,
+   u32 send_slope, u32 idle_slope,
+   u32 high_credit, u32 low_credit, u32 queue)
+{
+   void __iomem *ioaddr = hw->pcsr;
+   u32 value;
+
+   writel(send_slope, ioaddr + XGMAC_MTL_TCx_SENDSLOPE(queue));
+   writel(idle_slope, ioaddr + XGMAC_MTL_TCx_QUANTUM_WEIGHT(queue));
+   writel(high_credit, ioaddr + XGMAC_MTL_TCx_HICREDIT(queue));
+   writel(low_credit, ioaddr + XGMAC_MTL_TCx_LOCREDIT(queue));
+
+   value = readl(ioaddr + XGMAC_MTL_TCx_ETS_CONTROL(queue));
+   value |= XGMAC_CC | XGMAC_CBS;
+   writel(value, ioaddr + XGMAC_MTL_TCx_ETS_CONTROL(queue));
+}
+
 static int dwxgmac2_host_irq_status(struct mac_device_info *hw,
struct stmmac_extra_stats *x)
 {
@@ -316,7 +333,7 @@ const struct stmmac_ops dwxgmac210_ops = {
.prog_mtl_tx_algorithms = dwxgmac2_prog_mtl_tx_algorithms,
.set_mtl_tx_queue_weight = NULL,
.map_mtl_to_dma = dwxgmac2_map_mtl_to_dma,
-   .config_cbs = NULL,
+   .config_cbs = dwxgmac2_config_cbs,
.dump_regs = NULL,
.host_irq_status = dwxgmac2_host_irq_status,
.host_mtl_irq_status = dwxgmac2_host_mtl_irq_status,
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c 
b/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c
index 20909036e002..6c5092e7771c 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c
@@ -182,6 +182,9 @@ static void dwxgmac2_dma_tx_mode(void __iomem *ioaddr, int 
mode,
value |= 0x7 << XGMAC_TTC_SHIFT;
}
 
+   /* Use static TC to Queue mapping */
+   value |= (channel << XGMAC_Q2TCMAP_SHIFT) & XGMAC_Q2TCMAP;
+
value &= ~XGMAC_TXQEN;
if (qmode != MTL_QUEUE_AVB)
value |= 0x2 << XGMAC_TXQEN_SHIFT;
@@ -374,6 +377,21 @@ static void dwxgmac2_enable_tso(void __iomem *ioaddr, bool 
en, u32 chan)
writel(value, ioaddr + XGMAC_DMA_CH_TX_CONTROL(chan));
 }
 
+static void dwxgmac2_qmode(void __iomem *ioaddr, u32 channel, u8 qmode)
+{
+   u32 value = readl(ioaddr + XGMAC_MTL_TXQ_OPMODE(channel));
+
+   value &= ~XGMAC_TXQEN;
+   if (qmode != MTL_QUEUE_AVB) {
+   value |= 0x2 << XGMAC_TXQEN_SHIFT

[PATCH net-next] net/sched: fix type of htb statistics

2018-08-30 Thread Florent Fourcot
tokens and ctokens are defined as s64 in htb_class structure,
and clamped to 32bits value during netlink dumps:

cl->xstats.tokens = clamp_t(s64, PSCHED_NS2TICKS(cl->tokens),
INT_MIN, INT_MAX);

Defining it as u32 is working since userspace (tc) is printing it as
signed int, but a correct definition from the beginning is probably
better.

In the same time, 'giants' structure member is unused since years, so
update the comment to mark it unused.

Signed-off-by: Florent Fourcot 
---
 include/uapi/linux/pkt_sched.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 8975fd1a1421..e9b7244ac381 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -395,9 +395,9 @@ enum {
 struct tc_htb_xstats {
__u32 lends;
__u32 borrows;
-   __u32 giants;   /* too big packets (rate will not be accurate) */
-   __u32 tokens;
-   __u32 ctokens;
+   __u32 giants;   /* unused since 'Make HTB scheduler work with TSO.' */
+   __s32 tokens;
+   __s32 ctokens;
 };
 
 /* HFSC section */
-- 
2.11.0



[PATCH iproute2] tc/htb: remove unused variable

2018-08-30 Thread Florent Fourcot
Since introduction of htb module, this variable has never been used.

Signed-off-by: Florent Fourcot 
---
 tc/q_htb.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/tc/q_htb.c b/tc/q_htb.c
index b93d31d4..c8b2941d 100644
--- a/tc/q_htb.c
+++ b/tc/q_htb.c
@@ -109,7 +109,6 @@ static int htb_parse_opt(struct qdisc_util *qu, int argc,
 
 static int htb_parse_class_opt(struct qdisc_util *qu, int argc, char **argv, 
struct nlmsghdr *n, const char *dev)
 {
-   int ok = 0;
struct tc_htb_opt opt = {};
__u32 rtab[256], ctab[256];
unsigned buffer = 0, cbuffer = 0;
@@ -127,7 +126,6 @@ static int htb_parse_class_opt(struct qdisc_util *qu, int 
argc, char **argv, str
if (get_u32(&opt.prio, *argv, 10)) {
explain1("prio"); return -1;
}
-   ok++;
} else if (matches(*argv, "mtu") == 0) {
NEXT_ARG();
if (get_u32(&mtu, *argv, 10)) {
@@ -161,7 +159,6 @@ static int htb_parse_class_opt(struct qdisc_util *qu, int 
argc, char **argv, str
explain1("buffer");
return -1;
}
-   ok++;
} else if (matches(*argv, "cburst") == 0 ||
   strcmp(*argv, "cbuffer") == 0 ||
   strcmp(*argv, "cmaxburst") == 0) {
@@ -170,7 +167,6 @@ static int htb_parse_class_opt(struct qdisc_util *qu, int 
argc, char **argv, str
explain1("cbuffer");
return -1;
}
-   ok++;
} else if (strcmp(*argv, "ceil") == 0) {
NEXT_ARG();
if (ceil64) {
@@ -186,7 +182,6 @@ static int htb_parse_class_opt(struct qdisc_util *qu, int 
argc, char **argv, str
explain1("ceil");
return -1;
}
-   ok++;
} else if (strcmp(*argv, "rate") == 0) {
NEXT_ARG();
if (rate64) {
@@ -202,7 +197,6 @@ static int htb_parse_class_opt(struct qdisc_util *qu, int 
argc, char **argv, str
explain1("rate");
return -1;
}
-   ok++;
} else if (strcmp(*argv, "help") == 0) {
explain();
return -1;
@@ -214,9 +208,6 @@ static int htb_parse_class_opt(struct qdisc_util *qu, int 
argc, char **argv, str
argc--; argv++;
}
 
-   /*  if (!ok)
-   return 0;*/
-
if (!rate64) {
fprintf(stderr, "\"rate\" is required.\n");
return -1;
-- 
2.11.0



Re: [iproute PATCH] iprule: Fix for incorrect space between dst and prefix

2018-08-30 Thread Stephen Hemminger
On Wed, 29 Aug 2018 15:52:57 +0200
Phil Sutter  wrote:

> This was added by accident when introducing JSON support.
> 
> Fixes: 0dd4ccc56c0e3 ("iprule: add json support")
> Signed-off-by: Phil Sutter 


Applied, thanks.


Re: [Patch iproute2 v2] ss: add UNIX_DIAG_VFS and UNIX_DIAG_ICONS for unix sockets

2018-08-30 Thread Stephen Hemminger
On Wed, 29 Aug 2018 10:09:27 -0700
Cong Wang  wrote:

> UNIX_DIAG_VFS and UNIX_DIAG_ICONS are never used by ss,
> make them available in ss -e output.
> 
> Cc: Stephen Hemminger 
> Signed-off-by: Cong Wang 

Applied, thanks


Re: [PATCHv3 iproute2 0/2] clang + misc changes

2018-08-30 Thread Stephen Hemminger
On Wed, 22 Aug 2018 18:01:30 -0700
Mahesh Bandewar  wrote:

> From: Mahesh Bandewar 
> 
> The primary theme is to make clang compile the iproute2 package without
> warnings. Along with this there are two other misc patches in the series.
> 
> First patch uses the preferred_family when operating with maddr feature.
> Prior to this patch, it would always open an AF_INET socket irrespective
> of the family that is preferred via command-line. 
> 
> Second patch mostly adds format attributes to make the c-lang compiler
> happy and not throw the warning messages.
> 
> Mahesh Bandewar (2):
>   ipmaddr: use preferred_family when given
>   iproute: make clang happy with iproute2 package
> 
>  include/json_writer.h |  3 +--
>  ip/iplink_can.c   | 19 ---
>  ip/ipmaddr.c  | 13 -
>  lib/color.c   |  1 +
>  lib/json_print.c  |  1 +
>  lib/json_writer.c | 15 +--
>  misc/ss.c |  3 ++-
>  tc/m_ematch.c |  1 +
>  tc/m_ematch.h |  1 +
>  9 files changed, 32 insertions(+), 25 deletions(-)
> 
> -- 
> 2.18.0.1017.ga543ac7ca45-goog
> 

Applied, thanks


Re: [PATCH iproute2] tc/htb: remove unused variable

2018-08-30 Thread Stephen Hemminger
On Thu, 30 Aug 2018 16:38:54 +0200
Florent Fourcot  wrote:

> Since introduction of htb module, this variable has never been used.
> 
> Signed-off-by: Florent Fourcot 

Looks good. Applied



Re: [PATCH net] tcp: do not restart timewait timer on rst reception

2018-08-30 Thread Eric Dumazet



On 08/30/2018 05:24 AM, Florian Westphal wrote:
> RFC 1337 says:
>  ''Ignore RST segments in TIME-WAIT state.
>If the 2 minute MSL is enforced, this fix avoids all three hazards.''
> 
> So with net.ipv4.tcp_rfc1337=1, expected behaviour is to have TIME-WAIT sk
> expire rather than removing it instantly when a reset is received.
> 
> However, Linux will also re-start the TIME-WAIT timer.
> 
> This causes connect to fail when tying to re-use ports or very long
> delays (until syn retry interval exceeds MSL).
>


> Reported-by: Michal Tesar 
> Signed-off-by: Florian Westphal 
> ---

SGTM, thanks.

Signed-off-by: Eric Dumazet 



[PATCH v3 bpf-next 2/2] new sample bpf prog

2018-08-30 Thread Nikita V. Shirokov
sample program which shows TCP_SAVE_SYN/TCP_SAVED_SYN usage example:
bpf's program which is doing TOS/TCLASS reflection (server would reply
with a same TOS/TCLASS as client)

Signed-off-by: Nikita V. Shirokov 
---
 samples/bpf/Makefile   |  1 +
 samples/bpf/tcp_tos_reflect_kern.c | 87 ++
 2 files changed, 88 insertions(+)
 create mode 100644 samples/bpf/tcp_tos_reflect_kern.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 36f9f41d094b..be0a961450bc 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -153,6 +153,7 @@ always += tcp_cong_kern.o
 always += tcp_iw_kern.o
 always += tcp_clamp_kern.o
 always += tcp_basertt_kern.o
+always += tcp_tos_reflect_kern.o
 always += xdp_redirect_kern.o
 always += xdp_redirect_map_kern.o
 always += xdp_redirect_cpu_kern.o
diff --git a/samples/bpf/tcp_tos_reflect_kern.c 
b/samples/bpf/tcp_tos_reflect_kern.c
new file mode 100644
index ..d51dab19eca6
--- /dev/null
+++ b/samples/bpf/tcp_tos_reflect_kern.c
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2018 Facebook
+ *
+ * BPF program to automatically reflect TOS option from received syn packet
+ *
+ * Use load_sock_ops to load this BPF program.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+#define DEBUG 1
+
+#define bpf_printk(fmt, ...)   \
+({ \
+  char fmt[] = fmt;\
+  bpf_trace_printk(fmt, sizeof(fmt),   \
+   ##__VA_ARGS__); \
+})
+
+SEC("sockops")
+int bpf_basertt(struct bpf_sock_ops *skops)
+{
+   char header[sizeof(struct ipv6hdr)];
+   struct ipv6hdr *hdr6;
+   struct iphdr *hdr;
+   int hdr_size = 0;
+   int save_syn = 1;
+   int tos = 0;
+   int rv = 0;
+   int op;
+
+   op = (int) skops->op;
+
+#ifdef DEBUG
+   bpf_printk("BPF command: %d\n", op);
+#endif
+   switch (op) {
+   case BPF_SOCK_OPS_TCP_LISTEN_CB:
+   rv = bpf_setsockopt(skops, SOL_TCP, TCP_SAVE_SYN,
+  &save_syn, sizeof(save_syn));
+   break;
+   case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
+   if (skops->family == AF_INET)
+   hdr_size = sizeof(struct iphdr);
+   else
+   hdr_size = sizeof(struct ipv6hdr);
+   rv = bpf_getsockopt(skops, SOL_TCP, TCP_SAVED_SYN,
+   header, hdr_size);
+   if (!rv) {
+   if (skops->family == AF_INET) {
+   hdr = (struct iphdr *) header;
+   tos = hdr->tos;
+   if (tos != 0)
+   bpf_setsockopt(skops, SOL_IP, IP_TOS,
+  &tos, sizeof(tos));
+   } else {
+   hdr6 = (struct ipv6hdr *) header;
+   tos = ((hdr6->priority) << 4 |
+  (hdr6->flow_lbl[0]) >>  4);
+   if (tos)
+   bpf_setsockopt(skops, SOL_IPV6,
+  IPV6_TCLASS,
+  &tos, sizeof(tos));
+   }
+   rv = 0;
+   }
+   break;
+   default:
+   rv = -1;
+   }
+#ifdef DEBUG
+   bpf_printk("Returning %d\n", rv);
+#endif
+   skops->reply = rv;
+   return 1;
+}
+char _license[] SEC("license") = "GPL";
-- 
2.17.1



[PATCH v3 bpf-next 1/2] new options for bpf_(set|get)sockopt

2018-08-30 Thread Nikita V. Shirokov
adding support for two new bpf's get/set sockopts: TCP_SAVE_SYN (set)
and TCP_SAVED_SYN (get). this would allow for bpf program to build
logic based on data from ingress SYN packet

Signed-off-by: Nikita V. Shirokov 
---
 net/core/filter.c | 25 +
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index c25eb36f1320..feb578506009 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4007,6 +4007,12 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, 
bpf_sock,
tp->snd_ssthresh = val;
}
break;
+   case TCP_SAVE_SYN:
+   if (val < 0 || val > 1)
+   ret = -EINVAL;
+   else
+   tp->save_syn = val;
+   break;
default:
ret = -EINVAL;
}
@@ -4032,21 +4038,32 @@ static const struct bpf_func_proto bpf_setsockopt_proto 
= {
 BPF_CALL_5(bpf_getsockopt, struct bpf_sock_ops_kern *, bpf_sock,
   int, level, int, optname, char *, optval, int, optlen)
 {
+   struct inet_connection_sock *icsk;
struct sock *sk = bpf_sock->sk;
+   struct tcp_sock *tp;
 
if (!sk_fullsock(sk))
goto err_clear;
-
 #ifdef CONFIG_INET
if (level == SOL_TCP && sk->sk_prot->getsockopt == tcp_getsockopt) {
-   if (optname == TCP_CONGESTION) {
-   struct inet_connection_sock *icsk = inet_csk(sk);
+   switch (optname) {
+   case TCP_CONGESTION:
+   icsk = inet_csk(sk);
 
if (!icsk->icsk_ca_ops || optlen <= 1)
goto err_clear;
strncpy(optval, icsk->icsk_ca_ops->name, optlen);
optval[optlen - 1] = 0;
-   } else {
+   break;
+   case TCP_SAVED_SYN:
+   tp = tcp_sk(sk);
+
+   if (optlen <= 0 || !tp->saved_syn ||
+   optlen > tp->saved_syn[0])
+   goto err_clear;
+   memcpy(optval, tp->saved_syn + 1, optlen);
+   break;
+   default:
goto err_clear;
}
} else if (level == SOL_IP) {
-- 
2.17.1



[PATCH v3 bpf-next 0/2] bpf tcp save syn set/get sockoptions

2018-08-30 Thread Nikita V. Shirokov


adding supprot for two new bpf's tcp sockopts:
TCP_SAVE_SYN (set) and TCP_SAVED_SYN (get)
this would allow for tcp-bpf program to build some logic based on fields from
ingress syn packet (e.g. doing tcp's tos/tclass reflection (see sample prog))
and do it transparently from userspace program point of view

v2->v3:
 - make patch series public
v1->v2:
 - adding proper SPDX license

Nikita V. Shirokov (2):
  new options for bpf_(set|get)sockopt
  new sample bpf prog

 net/core/filter.c  | 25 +++--
 samples/bpf/Makefile   |  1 +
 samples/bpf/tcp_tos_reflect_kern.c | 87 ++
 3 files changed, 109 insertions(+), 4 deletions(-)
 create mode 100644 samples/bpf/tcp_tos_reflect_kern.c

-- 
2.17.1



Re: [pull request][net-next 00/10] Mellanox, mlx5 and devlink updates 2018-07-31

2018-08-30 Thread Alexander Duyck
I'm dropping all the old comments since the conversation was flattened
and only has one level of marks for everything.

On Thu, Aug 30, 2018 at 7:43 AM Alex Vesker  wrote:



> To which devlink interfaces are you referring?

All of them. Not just the ones in this patch. If you are exposing an
interface to the user you should have documentation for it somewhere.
You should probably look at adding a patch to make certain you have
all the existing devlink interfaces in the driver documented.

I would like to see something added to the documentation folder that
explains what all the DEVLINK_PARAM_GENERIC interfaces are expected to
do, and maybe why I would use them. Then in addition I would like to
see per-driver documentation added for the DEVLINK_PARAM_DRIVER calls.
So for example I can't find any documentation in the kernel on what
enable_64b_cqe_eqe or enable_4k_uar do in mlx4 or why I would need
them, but you have them exposed as interfaces to userspace.

> There are 3 patches here that provide the crdump capability,
> these are the patches I would like to resubmit.
>
> net/mlx5: Add Vendor Specific Capability access gateway:
> This is needed to read from the VSC by only the driver to collect a dump

You should probably work with the linux-pci mailing list on this bit
since you are exposing a new capability and they can probably point
you in the direction of how they want to deal with any potential races
in terms of access to the device versus your capability which you are
adding support for dumping via devlink.

> net/mlx5: Add Crdump FW snapshot support
> This is code that collects the dump and registers a region called crdump
> net/mlx5: Use devlink region_snapshot parameter
> Here I use an already implemented global param that specifies whether
> snapshots are supported.
>
> The devlink region feature is well documented.

Where?

> can it be that you referring to devlink region called "crdump" which mlx5 
> exposes?

I don't care about the internals. I care about user available
documentation for the interface that is exposed. How do you expect the
user to use this functionality? That is what I want documented.



> Will it be sufficient to prevent setcpi access using "pci_cfg_access_lock -
> any userspace reads or writes to config space and concurrent lock requests 
> will sleep"
> otherwise do you have a different solution?

That sounds like a step in the right direction, but that is something
you should work with the linux-pci list on. My main concern is that I
don't want us being able to come at this interface from multiple
directions and screw things up.


Re: WARNING in handle_irq (3)

2018-08-30 Thread Dmitry Vyukov
On Thu, Aug 30, 2018 at 8:31 AM, syzbot
 wrote:
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit:58c3f14f86c9 Merge tag 'riscv-for-linus-4.19-rc2' of git:/..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=10be176a40
> kernel config:  https://syzkaller.appspot.com/x/.config?x=531a917630d2a492
> dashboard link: https://syzkaller.appspot.com/bug?extid=a58b558e3e62d0604e5c
> compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
>
> Unfortunately, I don't have any reproducer for this crash yet.

+bpf maintainers

Looks suspiciously similar to:
https://groups.google.com/d/msg/syzkaller-bugs/4v7MtbIT1hY/A87hInzyAwAJ

Note this commit seems to already have "bpf, sockmap: fix
sock_hash_alloc and reject zero-sized keys ".

Tentative reproducer from the log is:

14:08:59 executing program 5:
socketpair(0x2, 0x0, 0x0, &(0x7f000140))
r0 = socket$inet6_tcp(0xa, 0x1, 0x0)
r1 = socket$inet6_tcp(0xa, 0x1, 0x0)
bind$inet6(r1, &(0x7fc0)={0xa, 0x4e22}, 0x1c)
listen(r1, 0x0)
sendto$inet6(r0, &(0x7f000140), 0x2d6, 0x2004,
&(0x7f80)={0xa, 0x10004e22, 0x0, @loopback}, 0x1c)
setsockopt$inet6_tcp_TCP_ULP(r0, 0x6, 0x1f, &(0x7f80)='tls\x00', 0x152)
r2 = bpf$MAP_CREATE(0x0, &(0x7f000280)={0xf, 0x4, 0x4, 0x70}, 0x2c)
bpf$MAP_UPDATE_ELEM(0x2, &(0x7f000180)={r2, &(0x7f00),
&(0x7f000140)}, 0x20)

Which does not create a 0-key map.



> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+a58b558e3e62d0604...@syzkaller.appspotmail.com
>
> TCP: request_sock_TCPv6: Possible SYN flooding on port 20002. Sending
> cookies.  Check SNMP counters.
> [ cut here ]
> do_IRQ(): syz-executor5 has overflown the kernel stack
> (cur:88018aec,sp:88018aeb0e18,irq stk
> top-bottom:8801db80-8801db008000,exception stk
> top-bottom:fe007080-fe011000,ip:lock_is_held_type+0x18b/0x210)
> WARNING: CPU: 0 PID: 13805 at arch/x86/kernel/irq_64.c:64
> stack_overflow_check arch/x86/kernel/irq_64.c:61 [inline]
> WARNING: CPU: 0 PID: 13805 at arch/x86/kernel/irq_64.c:64
> handle_irq+0x1fb/0x2e7 arch/x86/kernel/irq_64.c:73
> Kernel panic - not syncing: panic_on_warn set ...
>
> CPU: 0 PID: 13805 Comm: syz-executor5 Not tainted 4.19.0-rc1+ #215
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
>  panic+0x238/0x4e7 kernel/panic.c:184
>  __warn.cold.8+0x163/0x1ba kernel/panic.c:536
>  report_bug+0x252/0x2d0 lib/bug.c:186
>  fixup_bug arch/x86/kernel/traps.c:178 [inline]
>  do_error_trap+0x1fc/0x4d0 arch/x86/kernel/traps.c:296
>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:316
>  invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:993
> RIP: 0010:stack_overflow_check arch/x86/kernel/irq_64.c:61 [inline]
> RIP: 0010:handle_irq+0x1fb/0x2e7 arch/x86/kernel/irq_64.c:73
> Code: 00 00 ff b6 80 00 00 00 48 c7 c7 80 ca 24 87 41 54 41 55 65 48 8b 04
> 25 40 ee 01 00 48 05 68 06 00 00 48 89 c6 e8 95 c4 1c 00 <0f> 0b 48 83 c4 18
> e9 3f ff ff ff 48 89 75 e0 e8 c1 fe 90 00 48 8b
> RSP: 0018:8801db007f58 EFLAGS: 00010082
> RAX:  RBX: 8801cee0ad80 RCX: 
> RDX: 0001 RSI: 8163ac01 RDI: 0001
> RBP: 8801db007fb0 R08: 8801c9b4a700 R09: ed003b603eca
> R10: ed003b603eca R11: 8801db01f657 R12: fe011000
> R13: fe007080 R14: 002a R15: 
>  do_IRQ+0x80/0x1a0 arch/x86/kernel/irq.c:246
>  common_interrupt+0xf/0xf arch/x86/entry/entry_64.S:643
>  
> Dumping ftrace buffer:
>(ftrace buffer empty)
> Kernel Offset: disabled
> Rebooting in 86400 seconds..
>
>
> ---
> This bug is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkal...@googlegroups.com.
>
> syzbot will keep track of this bug report. See:
> https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with
> syzbot.


Re: [PATCH net 1/2] selftests: pmtu: maximum MTU for vti4 is 2^16-1-20

2018-08-30 Thread Nicolas Dichtel
Le 30/08/2018 à 16:01, Sabrina Dubroca a écrit :
> Since commit 82612de1c98e ("ip_tunnel: restore binding to ifaces with a
> large mtu"), the maximum MTU for vti4 is based on IP_MAX_MTU instead of
> the mysterious constant 0xFFF8.  This makes this selftest fail.
> 
> Fixes: 82612de1c98e ("ip_tunnel: restore binding to ifaces with a large mtu")
> Signed-off-by: Sabrina Dubroca 
> Acked-by: Stefano Brivio 

Thanks for fixing this.

Acked-by: Nicolas Dichtel 


[PATCH net] ipv6: don't get lwtstate twice in ip6_rt_copy_init()

2018-08-30 Thread Alexey Kodanev
Commit 80f1a0f4e0cd ("net/ipv6: Put lwtstate when destroying fib6_info")
partially fixed the kmemleak [1], lwtstate can be copied from fib6_info,
with ip6_rt_copy_init(), and it should be done only once there.

rt->dst.lwtstate is set by ip6_rt_init_dst(), at the start of the function
ip6_rt_copy_init(), so there is no need to get it again at the end.

With this patch, lwtstate also isn't copied from RTF_REJECT routes.

[1]:
unreferenced object 0x880b6aaa14e0 (size 64):
  comm "ip", pid 10577, jiffies 4295149341 (age 1273.903s)
  hex dump (first 32 bytes):
01 00 04 00 04 00 00 00 10 00 00 00 00 00 00 00  
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
  backtrace:
[<18664623>] lwtunnel_build_state+0x1bc/0x420
[] ip6_route_info_create+0x9f7/0x1fd0
[] ip6_route_add+0x14/0x70
[<8537b55c>] inet6_rtm_newroute+0xd9/0xe0
[<2acc50f5>] rtnetlink_rcv_msg+0x66f/0x8e0
[<8d9cd381>] netlink_rcv_skb+0x268/0x3b0
[<4c893c76>] netlink_unicast+0x417/0x5a0
[] netlink_sendmsg+0x70b/0xc30
[<890ff0aa>] sock_sendmsg+0xb1/0xf0
[] ___sys_sendmsg+0x659/0x950
[<1e7426c8>] __sys_sendmsg+0xde/0x170
[] do_syscall_64+0x9f/0x4a0
[<1be7b28b>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[<6d21f353>] 0x

Fixes: 6edb3c96a5f0 ("net/ipv6: Defer initialization of dst to data path")
Signed-off-by: Alexey Kodanev 
---
 net/ipv6/route.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 8e08a91..9f27ada 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -996,7 +996,6 @@ static void ip6_rt_copy_init(struct rt6_info *rt, struct 
fib6_info *ort)
rt->rt6i_src = ort->fib6_src;
 #endif
rt->rt6i_prefsrc = ort->fib6_prefsrc;
-   rt->dst.lwtstate = lwtstate_get(ort->fib6_nh.nh_lwtstate);
 }
 
 static struct fib6_node* fib6_backtrack(struct fib6_node *fn,
-- 
1.8.3.1



Re: [PATCH net] ipv6: don't get lwtstate twice in ip6_rt_copy_init()

2018-08-30 Thread David Ahern
On 8/30/18 10:11 AM, Alexey Kodanev wrote:
> Commit 80f1a0f4e0cd ("net/ipv6: Put lwtstate when destroying fib6_info")
> partially fixed the kmemleak [1], lwtstate can be copied from fib6_info,
> with ip6_rt_copy_init(), and it should be done only once there.
> 
> rt->dst.lwtstate is set by ip6_rt_init_dst(), at the start of the function
> ip6_rt_copy_init(), so there is no need to get it again at the end.
> 
> With this patch, lwtstate also isn't copied from RTF_REJECT routes.

Those should not have lwtstate set.

> 
> [1]:
> unreferenced object 0x880b6aaa14e0 (size 64):
>   comm "ip", pid 10577, jiffies 4295149341 (age 1273.903s)
>   hex dump (first 32 bytes):
> 01 00 04 00 04 00 00 00 10 00 00 00 00 00 00 00  
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>   backtrace:
> [<18664623>] lwtunnel_build_state+0x1bc/0x420
> [] ip6_route_info_create+0x9f7/0x1fd0
> [] ip6_route_add+0x14/0x70
> [<8537b55c>] inet6_rtm_newroute+0xd9/0xe0
> [<2acc50f5>] rtnetlink_rcv_msg+0x66f/0x8e0
> [<8d9cd381>] netlink_rcv_skb+0x268/0x3b0
> [<4c893c76>] netlink_unicast+0x417/0x5a0
> [] netlink_sendmsg+0x70b/0xc30
> [<890ff0aa>] sock_sendmsg+0xb1/0xf0
> [] ___sys_sendmsg+0x659/0x950
> [<1e7426c8>] __sys_sendmsg+0xde/0x170
> [] do_syscall_64+0x9f/0x4a0
> [<1be7b28b>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [<6d21f353>] 0x

What test did you run to uncover this? Curious as to why my testing that
found the need for 80f1a0f4e0cd did not hit this.

> 
> Fixes: 6edb3c96a5f0 ("net/ipv6: Defer initialization of dst to data path")
> Signed-off-by: Alexey Kodanev 
> ---
>  net/ipv6/route.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 8e08a91..9f27ada 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -996,7 +996,6 @@ static void ip6_rt_copy_init(struct rt6_info *rt, struct 
> fib6_info *ort)
>   rt->rt6i_src = ort->fib6_src;
>  #endif
>   rt->rt6i_prefsrc = ort->fib6_prefsrc;
> - rt->dst.lwtstate = lwtstate_get(ort->fib6_nh.nh_lwtstate);
>  }
>  
>  static struct fib6_node* fib6_backtrack(struct fib6_node *fn,
> 

Thanks for the patch.

Reviewed-by: David Ahern 



Re: [PATCH net-next 1/2] netlink: ipv4 IGMP join notifications

2018-08-30 Thread Patrick Ruddy
Don't know what happened to the 0/2 cover for this series so here it
is:

This patch is an update to https://patchwork.ozlabs.org/patch/571127/.
The
previous patch was based on sending multicast MAC addresses in the
netlink messages to allow the programming of hardware. It was agreed to
rework this to use RTM_NEW/DELLINK messages which were more appropriate
for layer 2 addresses.
In the interim period it has become apparent that the applications
actually
needs to see the L3 multicast addresses which are joined for FORUS
processing so this patch has been reworked to send the L3 multicast
addresses using RTM_NEW/DELADDR.
These new multicast L3 netlink notifications should use the
IFA_MULTICAST
address type but this has been dropped in favour of IFA_ADDRESS as
during
testing it was noticed that some applications - notably getaddrinfo in
lib6c assume that there is an IFA_ADDRESS in a RTM_NEW/DELADDR and
blindly dereference it.
Finally the RTM_GETADDR for both address families has been modified to
include the multicast l3 addresses.

Patrick Ruddy (2):
  netlink: ipv4 IGMP join notifications
  netlink: ipv6 MLD join notifications

 include/linux/igmp.h |  2 +
 net/ipv4/devinet.c   | 39 +--
 net/ipv4/igmp.c  | 90 
 net/ipv6/addrconf.c  | 44 --
 net/ipv6/mcast.c | 66 
 5 files changed, 218 insertions(+), 23 deletions(-)

-- 
2.17.1

On Thu, 2018-08-30 at 10:35 +0100, Patrick Ruddy wrote:
> Some userspace applications need to know about IGMP joins from the kernel
> for 2 reasons
> 1. To allow the programming of multicast MAC filters in hardware
> 2. To form a multicast FORUS list for non link-local multicast
>groups to be sent to the kernel and from there to the interested
>party.
> (1) can be fulfilled but simply sending the hardware multicast MAC
> address to be programmed but (2) requires the L3 address to be sent
> since this cannot be constructed from the MAC address whereas the
> reverse translation is a standard library function.
> 
> This commit provides addition and deletion of multicast addresses
> using the RTM_NEWADDR and RTM_DELADDR messages. It also provides
> the RTM_GETADDR extension to allow multicast join state to be read
> from the kernel.
> 
> Signed-off-by: Patrick Ruddy 
> ---
>  include/linux/igmp.h |  2 +
>  net/ipv4/devinet.c   | 39 +--
>  net/ipv4/igmp.c  | 90 
>  3 files changed, 120 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/igmp.h b/include/linux/igmp.h
> index 119f53941c12..1fb417865e7d 100644
> --- a/include/linux/igmp.h
> +++ b/include/linux/igmp.h
> @@ -130,6 +130,8 @@ extern void ip_mc_unmap(struct in_device *);
>  extern void ip_mc_remap(struct in_device *);
>  extern void ip_mc_dec_group(struct in_device *in_dev, __be32 addr);
>  extern void ip_mc_inc_group(struct in_device *in_dev, __be32 addr);
> +extern int ip_mc_dump_ifaddr(struct sk_buff *skb, struct netlink_callback 
> *cb,
> +  struct net_device *dev);
>  int ip_mc_check_igmp(struct sk_buff *skb, struct sk_buff **skb_trimmed);
>  
>  #endif
> diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
> index ea4bd8a52422..42f7dcc4fb5e 100644
> --- a/net/ipv4/devinet.c
> +++ b/net/ipv4/devinet.c
> @@ -57,6 +57,7 @@
>  #endif
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -1651,6 +1652,7 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct 
> netlink_callback *cb)
>   int h, s_h;
>   int idx, s_idx;
>   int ip_idx, s_ip_idx;
> + int multicast, mcast_idx;
>   struct net_device *dev;
>   struct in_device *in_dev;
>   struct in_ifaddr *ifa;
> @@ -1659,6 +1661,8 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct 
> netlink_callback *cb)
>   s_h = cb->args[0];
>   s_idx = idx = cb->args[1];
>   s_ip_idx = ip_idx = cb->args[2];
> + multicast = cb->args[3];
> + mcast_idx = cb->args[4];
>  
>   for (h = s_h; h < NETDEV_HASHENTRIES; h++, s_idx = 0) {
>   idx = 0;
> @@ -1675,18 +1679,29 @@ static int inet_dump_ifaddr(struct sk_buff *skb, 
> struct netlink_callback *cb)
>   if (!in_dev)
>   goto cont;
>  
> - for (ifa = in_dev->ifa_list, ip_idx = 0; ifa;
> -  ifa = ifa->ifa_next, ip_idx++) {
> - if (ip_idx < s_ip_idx)
> - continue;
> - if (inet_fill_ifaddr(skb, ifa,
> -  NETLINK_CB(cb->skb).portid,
> -  cb->nlh->nlmsg_seq,
> -  RTM_NEWADDR, NLM_F_MULTI) < 0) {
> - rcu_read_unlock();
> - goto done;
> + if (!multicast) 

Re: [PATCH 2/4] r8169: Get and enable optional ether_clk clock

2018-08-30 Thread Stephen Boyd
Quoting Hans de Goede (2018-08-29 10:09:57)
> Hi,
> 
> On 27-08-18 21:14, Stephen Boyd wrote:
> > Quoting Hans de Goede (2018-08-27 11:53:19)
> >> On 27-08-18 20:47, Stephen Boyd wrote:
> >>> How would you know that a clk device driver hasn't probed yet and isn't
> >>> the driver that's actually providing the clk to this device on x86
> >>> systems? With DT systems we can figure that out by looking at the DT and
> >>> seeing if the device driver requesting the clk has the clocks property.
> >>> On x86 systems it's all clkdev which doesn't really lend itself to
> >>> solving this problem.
> >>
> >> Right on x86 the assumption is that the clk driver will be builtin and
> >> will probe before the consumer. In this case that is true as the
> >> pmc-atom-clk driver can only be builtin and its platform device is
> >> instantiated from the acpi_lpss code and acpi init happens before
> >> the PCI bus is scanned.
> > 
> > If we can go with this assumption then we can make the optional clk API
> > work even on clkdev based systems. Maybe if x86 had some way of
> > indicating that all builtin clks are registered?
> 
> Unfortunately there is no such thing I'm afraid.

Ugh!

> 
> > That might work but
> > it's not very clean. Or if we could check to see if we're running on an
> > ACPI based system in clkdev we could use that to assume that clk_get()
> > will only be called after all providers have registered their lookups.
> 
> Yes some check for x86 + ACPI (ARM also uses ACPI, but there we
> should no do this AFAICT) is probably best. That or not use the
> new optional clk API on x86, but that means that any cross platform
> driver cannot use it, which would be a pain.

Right. The optional clk API will be not so great until we can get ACPI
to move way from clkdev.

> 
> BTW does your Acked-by indicate you are ok with merging this series
> through the netdev tree as I suggested in the cover-letter? If so
> can I also add your Acked-by to the 3th patch ?
> 

Yep, I thought I did that but now I've really done it.



Re: [PATCH 1/2] dt-bindings: net: cpsw: Document cpsw-phy-sel usage but prefer phandle

2018-08-30 Thread Grygorii Strashko




On 08/29/2018 07:47 PM, Tony Lindgren wrote:

* Grygorii Strashko  [180830 00:12]:

Hi Tony,

On 08/29/2018 10:00 AM, Tony Lindgren wrote:

The current cpsw usage for cpsw-phy-sel is undocumented but is used for
all the boards using cpsw. And cpsw-phy-sel is not really a child of
the cpsw device, it lives in the system control module instead.

Let's document the existing usage, and improve it a bit where we prefer
to use a phandle instead of a child device for it. That way we can
properly describe the hardware in dts files for things like genpd.


I'm ok with this series, but I really don't like cpsw-phy-sel in general.


Yeah this binding predates any standards. This series
only fixes the nasty issue of cpsw claiming a module as a
child that's outside it's IO range.


It was introduced long time back and now I'm thinking about possibility to 
replace it with
one of current generic interfaces - for example mux-controller.
Each port will control up to 3 muxes (port mode, idmode and rmii_ext_clk) and
transform phy-mode => mux states.
What do you think?


Sure a mux-controller here makes sense.


Another option is to use phy, but it'd be complicated.


For the port muxes, how about a phy driver just using
a pinctrl driver?

In general, it seems cpsw is just an interconnect instance
(L4_FAST) with a control module (CPSW_WR) and a pile of
independent other modules. That's described nicely in
am437x TRM chapter "2.1.4 L4 Fast Peripheral Memory Map".
So from that point of view the binding reg entries right
now are all wrong :)


TRM not consistent - for am5 it's one MMIO region.



In the long run cpsw should be really treated as an
interconnect instance with it's control module providing
standard Linux framework services such as clock /
regulator / phy / pinctrl / iio whatever for the other
modules.

Just my 2c based on looking at the interconnect, I'm
not too familiar with cpsw otherwise.


It's not separate modules. this is composite module which have only one 
fck/ick and most of blocks can't even function without each other.

Above might be the case for Keystone 2, but not omap CPSW.
Keystone 2 - has packet processor, security accelerator, queue manager 
in addition to its basic switch block.


--
regards,
-grygorii


Re: [PATCH v2 iproute2-next 0/3] support delivering packets in

2018-08-30 Thread David Ahern
On 8/26/18 8:42 PM, Yousuk Seung wrote:
> This series adds support for the new "slot" netem parameter for
> slotting. Slotting is an approximation of shared media that gather up
> packets within a varying delay window before delivering them nearly at
> once.
> 
> Dave Taht (2):
>   tc: support conversions to or from 64 bit nanosecond-based time
>   q_netem: support delivering packets in delayed time slots
> 
> Yousuk Seung (1):
>   q_netem: slotting with non-uniform distribution
> 
>  include/utils.h |  12 +
>  lib/utils.c | 104 +++
>  man/man8/tc-netem.8 |  40 ++-
>  tc/q_netem.c| 115 +++-
>  tc/tc_cbq.c |   1 +
>  tc/tc_core.c|   1 +
>  tc/tc_core.h|   2 -
>  tc/tc_estimator.c   |   1 +
>  tc/tc_util.c|  46 --
>  tc/tc_util.h|   3 --
>  10 files changed, 272 insertions(+), 53 deletions(-)
> 

applied to iproute2-next after fixing up a whitespace issue and 2
checkpatch errors in patch 2.



[PATCH net] ibmvnic: Include missing return code checks in reset function

2018-08-30 Thread Thomas Falcon
Check the return codes of these functions and halt reset
in case of failure. The driver will remain in a dormant state
until the next reset event, when device initialization will be
re-attempted.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index ffe7acb..d834308 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1841,11 +1841,17 @@ static int do_reset(struct ibmvnic_adapter *adapter,
adapter->map_id = 1;
release_rx_pools(adapter);
release_tx_pools(adapter);
-   init_rx_pools(netdev);
-   init_tx_pools(netdev);
+   rc = init_rx_pools(netdev);
+   if (rc)
+   return rc;
+   rc = init_tx_pools(netdev);
+   if (rc)
+   return rc;
 
release_napi(adapter);
-   init_napi(adapter);
+   rc = init_napi(adapter);
+   if (rc)
+   return rc;
} else {
rc = reset_tx_pools(adapter);
if (rc)
-- 
1.8.3.1



[bpf-next 0/3] Introduce eBPF flow dissector

2018-08-30 Thread Petar Penkov
From: Petar Penkov 

This patch series hardens the RX stack by allowing flow dissection in BPF,
as previously discussed [1]. Because of the rigorous checks of the BPF
verifier, this provides significant security guarantees. In particular, the
BPF flow dissector cannot get inside of an infinite loop, as with
CVE-2013-4348, because BPF programs are guaranteed to terminate. It cannot
read outside of packet bounds, because all memory accesses are checked.
Also, with BPF the administrator can decide which protocols to support,
reducing potential attack surface. Rarely encountered protocols can be
excluded from dissection and the program can be updated without kernel
recompile or reboot if a bug is discovered.

Patch 1 adds infrastructure to execute a BPF program in __skb_flow_dissect.
This includes a new BPF program and attach type.

Patch 2 adds a flow dissector program in BPF. This parses most protocols in
__skb_flow_dissect in BPF for a subset of flow keys (basic, control, ports,
and address types).

Patch 3 adds a selftest that attaches the BPF program to the flow dissector
and sends traffic with different levels of encapsulation.

Performance Evaluation:
The in-kernel implementation was compared against the demo program from
patch 2 using the test in patch 3 with IPv4/UDP traffic over 10 seconds.
$perf record -a -C 4 taskset -c 4 ./test_flow_dissector -i 4 -f 8 \
-t 10

In-kernel Dissector:
__skb_flow_dissect overhead: 2.12%
Total Packets: 3,272,597 (from output of ./test_flow_dissector)

BPF Dissector:
__skb_flow_dissect overhead: 1.63% 
Total Packets: 3,232,356 (from output of ./test_flow_dissector)

No-op Dissector:
__skb_flow_dissect overhead: 1.52% 
Total Packets: 3,330,635 (from output of ./test_flow_dissector)

Changes since RFC:

1/ (Patch 1) Flow dissector hook is no longer global. Instead, it is
per-netns

2/ (Patch 1) Defined struct bpf_flow_keys to be used in BPF flow dissector
programs instead of exposing the internal flow keys layout. Added a
function to translate from bpf_flow_keys to the internal layout after BPF
dissection is complete. The pointer to this struct is stored in
qdisc_skb_cb rather than inside of the 20 byte control block which
simplifies verification and allows access to all 20 bytes of the cb.

3/ (Patch 2) Removed GUE parsing as it relied on a hardcoded port

4/ (Patch 2) MPLS parsing now stops at the first label which is consistent
with the in-kernel flow dissector

5/ (Patch 2) Refactored to use direct packet access and to write out to
struct bpf_flow_keys

[1] http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf

Petar Penkov (3):
  flow_dissector: implements flow dissector BPF hook
  flow_dissector: implements eBPF parser
  selftests/bpf: test bpf flow dissection

 include/linux/bpf.h   |   1 +
 include/linux/bpf_types.h |   1 +
 include/linux/skbuff.h|   7 +
 include/net/net_namespace.h   |   3 +
 include/net/sch_generic.h |  12 +-
 include/uapi/linux/bpf.h  |  25 +
 kernel/bpf/syscall.c  |   8 +
 kernel/bpf/verifier.c |  33 +
 net/core/filter.c |  67 ++
 net/core/flow_dissector.c | 136 +++
 tools/bpf/bpftool/prog.c  |   1 +
 tools/include/uapi/linux/bpf.h|  25 +
 tools/lib/bpf/libbpf.c|   2 +
 tools/testing/selftests/bpf/.gitignore|   2 +
 tools/testing/selftests/bpf/Makefile  |   8 +-
 tools/testing/selftests/bpf/bpf_flow.c| 390 +
 tools/testing/selftests/bpf/config|   1 +
 .../selftests/bpf/flow_dissector_load.c   | 140 
 .../selftests/bpf/test_flow_dissector.c   | 782 ++
 .../selftests/bpf/test_flow_dissector.sh  | 115 +++
 tools/testing/selftests/bpf/with_addr.sh  |  54 ++
 tools/testing/selftests/bpf/with_tunnels.sh   |  36 +
 22 files changed, 1843 insertions(+), 6 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/bpf_flow.c
 create mode 100644 tools/testing/selftests/bpf/flow_dissector_load.c
 create mode 100644 tools/testing/selftests/bpf/test_flow_dissector.c
 create mode 100755 tools/testing/selftests/bpf/test_flow_dissector.sh
 create mode 100755 tools/testing/selftests/bpf/with_addr.sh
 create mode 100755 tools/testing/selftests/bpf/with_tunnels.sh

-- 
2.19.0.rc0.228.g281dcd1b4d0-goog



[PATCH v2 1/2] IB/ipoib: Use dev_port to expose network interface port numbers

2018-08-30 Thread Arseny Maslennikov
Some InfiniBand network devices have multiple ports on the same PCI
function. This initializes the `dev_port' sysfs field of those
network interfaces with their port number.

Prior to this the kernel erroneously used the `dev_id' sysfs
field of those network interfaces to convey the port number to userspace.

The use of `dev_id' was considered correct until Linux 3.15,
when another field, `dev_port', was defined for this particular
purpose and `dev_id' was reserved for distinguishing stacked ifaces
(e.g: VLANs) with the same hardware address as their parent device.

Similar fixes to net/mlx4_en and many other drivers, which started
exporting this information through `dev_id' before 3.15, were accepted
into the kernel 4 years ago.
See 76a066f2a2a0 (`net/mlx4_en: Expose port number through sysfs').

Signed-off-by: Arseny Maslennikov 
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index e3d28f9ad9c0..ba16a63ee303 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1880,7 +1880,7 @@ static int ipoib_parent_init(struct net_device *ndev)
   sizeof(union ib_gid));
 
SET_NETDEV_DEV(priv->dev, priv->ca->dev.parent);
-   priv->dev->dev_id = priv->port - 1;
+   priv->dev->dev_port = priv->port - 1;
 
return 0;
 }
-- 
2.18.0



[PATCH v2 2/2] Documentation/ABI: document /sys/class/net/*/dev_port

2018-08-30 Thread Arseny Maslennikov
The sysfs field was introduced 4 years ago along with fixes to various
drivers that erroneously used `dev_id' for that purpose, but it was not
properly documented anywhere.
See commit v3.14-rc3-739-g3f85944fe207.

Signed-off-by: Arseny Maslennikov 
---
 Documentation/ABI/testing/sysfs-class-net | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-class-net 
b/Documentation/ABI/testing/sysfs-class-net
index 2f1788111cd9..1593d8997ade 100644
--- a/Documentation/ABI/testing/sysfs-class-net
+++ b/Documentation/ABI/testing/sysfs-class-net
@@ -91,6 +91,16 @@ Description:
stacked (e.g: VLAN interfaces) but still have the same MAC
address as their parent device.
 
+What:  /sys/class/net//dev_port
+Date:  February 2014
+KernelVersion: 3.15
+Contact:   netdev@vger.kernel.org
+Description:
+   Indicates the port number of this network device, formatted
+   as a decimal value. Some NICs have multiple independent ports
+   on the same PCI bus, device and function. This field allows
+   userspace to distinguish the respective interfaces.
+
 What:  /sys/class/net//dormant
 Date:  March 2006
 KernelVersion: 2.6.17
-- 
2.18.0



[PATCH v2 0/2] IB/ipoib: Use dev_port to disambiguate port numbers

2018-08-30 Thread Arseny Maslennikov
Pre-3.15 userspace had trouble distinguishing different ports
of a NIC on a single PCI bus/device/function. To solve this,
a sysfs field `dev_port' was introduced quite a while ago
(commit v3.14-rc3-739-g3f85944fe207), and some relevant device
drivers were fixed to use it, but not in case of IPoIB.

The convention for some reason never got documented in the kernel, but
was immediately adopted by userspace (notably udev[1][2], biosdevname[3])

3/3 documents the sysfs field — that's why I'm CC-ing netdev.

This series was tested on and applies to 4.19-rc1.

[1] https://lists.freedesktop.org/archives/systemd-devel/2014-June/020788.html
[2] https://lists.freedesktop.org/archives/systemd-devel/2014-July/020804.html
[3] 
https://github.com/CloudAutomationNTools/biosdevname/blob/c795d51dd93a5309652f0d635f12a3ecfabfaa72/src/eths.c#L38

v1->v2: replace a line instead of inserting and then removing.

Arseny Maslennikov (2):
  IB/ipoib: Use dev_port to expose network interface port numbers
  Documentation/ABI: document /sys/class/net/*/dev_port

 Documentation/ABI/testing/sysfs-class-net | 10 ++
 drivers/infiniband/ulp/ipoib/ipoib_main.c |  2 +-
 2 files changed, 11 insertions(+), 1 deletion(-)

-- 
2.18.0



[bpf-next 2/3] flow_dissector: implements eBPF parser

2018-08-30 Thread Petar Penkov
From: Petar Penkov 

This eBPF program extracts basic/control/ip address/ports keys from
incoming packets. It supports recursive parsing for IP encapsulation,
and VLAN, along with IPv4/IPv6 and extension headers.  This program is
meant to show how flow dissection and key extraction can be done in
eBPF.

Link: http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf
Signed-off-by: Petar Penkov 
Signed-off-by: Willem de Bruijn 
---
 tools/testing/selftests/bpf/Makefile   |   2 +-
 tools/testing/selftests/bpf/bpf_flow.c | 390 +
 2 files changed, 391 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/bpf_flow.c

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index fff7fb1285fc..e65f50f9185e 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -35,7 +35,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o 
test_tcp_estats.o test
test_get_stack_rawtp.o test_sockmap_kern.o test_sockhash_kern.o \
test_lwt_seg6local.o sendmsg4_prog.o sendmsg6_prog.o 
test_lirc_mode2_kern.o \
get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
-   test_skb_cgroup_id_kern.o
+   test_skb_cgroup_id_kern.o bpf_flow.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
diff --git a/tools/testing/selftests/bpf/bpf_flow.c 
b/tools/testing/selftests/bpf/bpf_flow.c
new file mode 100644
index ..6d8388735d1d
--- /dev/null
+++ b/tools/testing/selftests/bpf/bpf_flow.c
@@ -0,0 +1,390 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+int _version SEC("version") = 1;
+#define PROG(F) SEC(#F) int bpf_func_##F
+
+/* These are the identifiers of the BPF programs that will be used in tail
+ * calls. Name is limited to 16 characters, with the terminating character and
+ * bpf_func_ above, we have only 6 to work with, anything after will be 
cropped.
+ */
+enum {
+   IP,
+   IPV6,
+   IPV6OP, /* Destination/Hop-by-Hop Options IPv6 Extension header */
+   IPV6FR, /* Fragmentation IPv6 Extension Header */
+   MPLS,
+   VLAN,
+};
+
+#define IP_MF  0x2000
+#define IP_OFFSET  0x1FFF
+#define IP6_MF 0x0001
+#define IP6_OFFSET 0xFFF8
+
+struct vlan_hdr {
+   __be16 h_vlan_TCI;
+   __be16 h_vlan_encapsulated_proto;
+};
+
+struct gre_hdr {
+   __be16 flags;
+   __be16 proto;
+};
+
+struct frag_hdr {
+   __u8 nexthdr;
+   __u8 reserved;
+   __be16 frag_off;
+   __be32 identification;
+};
+
+struct bpf_map_def SEC("maps") jmp_table = {
+   .type = BPF_MAP_TYPE_PROG_ARRAY,
+   .key_size = sizeof(__u32),
+   .value_size = sizeof(__u32),
+   .max_entries = 8
+};
+
+struct bpf_dissect_cb {
+   __u16 nhoff;
+};
+
+static __always_inline void *bpf_flow_dissect_get_header(struct __sk_buff *skb,
+__u16 hdr_size)
+{
+   struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+   void *data_end = (__u8 *)(long)skb->data_end;
+   void *data = (__u8 *)(long)skb->data;
+   __u8 *hdr;
+
+   /* Verifies this variable offset does not overflow */
+   if (cb->nhoff > (USHRT_MAX - hdr_size))
+   return NULL;
+
+   hdr = data + cb->nhoff;
+   if (hdr + hdr_size > data_end)
+   return NULL;
+
+   return hdr;
+}
+
+/* Dispatches on ETHERTYPE */
+static __always_inline int parse_eth_proto(struct __sk_buff *skb, __be16 proto)
+{
+   struct bpf_flow_keys *keys = (struct bpf_flow_keys 
*)(long)(skb->flow_keys);
+   keys->n_proto = proto;
+
+   switch (proto) {
+   case bpf_htons(ETH_P_IP):
+   bpf_tail_call(skb, &jmp_table, IP);
+   break;
+   case bpf_htons(ETH_P_IPV6):
+   bpf_tail_call(skb, &jmp_table, IPV6);
+   break;
+   case bpf_htons(ETH_P_MPLS_MC):
+   case bpf_htons(ETH_P_MPLS_UC):
+   bpf_tail_call(skb, &jmp_table, MPLS);
+   break;
+   case bpf_htons(ETH_P_8021Q):
+   case bpf_htons(ETH_P_8021AD):
+   bpf_tail_call(skb, &jmp_table, VLAN);
+   break;
+   default:
+   /* Protocol not supported */
+   return BPF_DROP;
+   }
+
+   return BPF_DROP;
+}
+
+static __always_inline void write_ports(struct __sk_buff *skb)
+{
+   struct bpf_flow_keys *keys = (struct bpf_flow_keys 
*)(long)(skb->flow_keys);
+   struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+   void *data = (__u8 *)(long)skb->data;
+   __be16 *ports = data + cb->nhoff; /* TCP/UDP start with the ports */
+
+   keys->sport = ports[0];

[bpf-next 1/3] flow_dissector: implements flow dissector BPF hook

2018-08-30 Thread Petar Penkov
From: Petar Penkov 

Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
path. The BPF program is per-network namespace.

Signed-off-by: Petar Penkov 
Signed-off-by: Willem de Bruijn 
---
 include/linux/bpf.h|   1 +
 include/linux/bpf_types.h  |   1 +
 include/linux/skbuff.h |   7 ++
 include/net/net_namespace.h|   3 +
 include/net/sch_generic.h  |  12 ++-
 include/uapi/linux/bpf.h   |  25 ++
 kernel/bpf/syscall.c   |   8 ++
 kernel/bpf/verifier.c  |  33 
 net/core/filter.c  |  67 
 net/core/flow_dissector.c  | 136 +
 tools/bpf/bpftool/prog.c   |   1 +
 tools/include/uapi/linux/bpf.h |  25 ++
 tools/lib/bpf/libbpf.c |   2 +
 13 files changed, 318 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 523481a3471b..988a00797bcd 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -212,6 +212,7 @@ enum bpf_reg_type {
PTR_TO_PACKET_META,  /* skb->data - meta_len */
PTR_TO_PACKET,   /* reg points to skb->data */
PTR_TO_PACKET_END,   /* skb->data + headlen */
+   PTR_TO_FLOW_KEYS,/* reg points to bpf_flow_keys */
 };
 
 /* The information passed from prog-specific *_is_valid_access
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index cd26c090e7c0..22083712dd18 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -32,6 +32,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
 #ifdef CONFIG_INET
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport)
 #endif
+BPF_PROG_TYPE(BPF_PROG_TYPE_FLOW_DISSECTOR, flow_dissector)
 
 BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 17a13e4785fc..ce0e863f02a2 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -243,6 +243,8 @@ struct scatterlist;
 struct pipe_inode_info;
 struct iov_iter;
 struct napi_struct;
+struct bpf_prog;
+union bpf_attr;
 
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
 struct nf_conntrack {
@@ -1192,6 +1194,11 @@ void skb_flow_dissector_init(struct flow_dissector 
*flow_dissector,
 const struct flow_dissector_key *key,
 unsigned int key_count);
 
+int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr,
+  struct bpf_prog *prog);
+
+int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr);
+
 bool __skb_flow_dissect(const struct sk_buff *skb,
struct flow_dissector *flow_dissector,
void *target_container,
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 9b5fdc50519a..99d4148e0f90 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -43,6 +43,7 @@ struct ctl_table_header;
 struct net_generic;
 struct uevent_sock;
 struct netns_ipvs;
+struct bpf_prog;
 
 
 #define NETDEV_HASHBITS8
@@ -145,6 +146,8 @@ struct net {
 #endif
struct net_generic __rcu*gen;
 
+   struct bpf_prog __rcu   *flow_dissector_prog;
+
/* Note : following structs are cache line aligned */
 #ifdef CONFIG_XFRM
struct netns_xfrm   xfrm;
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index a6d00093f35e..1b81ba85fd2d 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -19,6 +19,7 @@ struct Qdisc_ops;
 struct qdisc_walker;
 struct tcf_walker;
 struct module;
+struct bpf_flow_keys;
 
 typedef int tc_setup_cb_t(enum tc_setup_type type,
  void *type_data, void *cb_priv);
@@ -307,9 +308,14 @@ struct tcf_proto {
 };
 
 struct qdisc_skb_cb {
-   unsigned intpkt_len;
-   u16 slave_dev_queue_mapping;
-   u16 tc_classid;
+   union {
+   struct {
+   unsigned intpkt_len;
+   u16 slave_dev_queue_mapping;
+   u16 tc_classid;
+   };
+   struct bpf_flow_keys *flow_keys;
+   };
 #define QDISC_CB_PRIV_LEN 20
unsigned char   data[QDISC_CB_PRIV_LEN];
 };
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 66917a4eba27..3064706fcaaa 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -152,6 +152,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_LWT_SEG6LOCAL,
BPF_PROG_TYPE_LIRC_MODE2,
BPF_PROG_TYPE_SK_REUSEPORT,
+   BPF_PROG_TYPE_FLOW_DISSECTOR,
 };
 
 enum bpf_attach_type {
@@ -172,6 +173,7 @@ enum bpf_attach_type {
BPF_CGROUP_UDP4_SENDMSG,
BPF_CGROUP_UDP6_SEN

[bpf-next 3/3] selftests/bpf: test bpf flow dissection

2018-08-30 Thread Petar Penkov
From: Petar Penkov 

Adds a test that sends different types of packets over multiple
tunnels and verifies that valid packets are dissected correctly.  To do
so, a tc-flower rule is added to drop packets on UDP src port 9, and
packets are sent from ports 8, 9, and 10. Only the packets on port 9
should be dropped. Because tc-flower relies on the flow dissector to
match flows, correct classification demonstrates correct dissection.

Also add support logic to load the BPF program and to inject the test
packets.

Signed-off-by: Petar Penkov 
Signed-off-by: Willem de Bruijn 
---
 tools/testing/selftests/bpf/.gitignore|   2 +
 tools/testing/selftests/bpf/Makefile  |   6 +-
 tools/testing/selftests/bpf/config|   1 +
 .../selftests/bpf/flow_dissector_load.c   | 140 
 .../selftests/bpf/test_flow_dissector.c   | 782 ++
 .../selftests/bpf/test_flow_dissector.sh  | 115 +++
 tools/testing/selftests/bpf/with_addr.sh  |  54 ++
 tools/testing/selftests/bpf/with_tunnels.sh   |  36 +
 8 files changed, 1134 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/flow_dissector_load.c
 create mode 100644 tools/testing/selftests/bpf/test_flow_dissector.c
 create mode 100755 tools/testing/selftests/bpf/test_flow_dissector.sh
 create mode 100755 tools/testing/selftests/bpf/with_addr.sh
 create mode 100755 tools/testing/selftests/bpf/with_tunnels.sh

diff --git a/tools/testing/selftests/bpf/.gitignore 
b/tools/testing/selftests/bpf/.gitignore
index 49938d72cf63..e61a85ac4b79 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -19,3 +19,5 @@ test_btf
 test_sockmap
 test_lirc_mode2_user
 get_cgroup_id_user
+test_flow_dissector
+flow_dissector_load
diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index e65f50f9185e..fd3851d5c079 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -47,10 +47,12 @@ TEST_PROGS := test_kmod.sh \
test_tunnel.sh \
test_lwt_seg6local.sh \
test_lirc_mode2.sh \
-   test_skb_cgroup_id.sh
+   test_skb_cgroup_id.sh \
+   test_flow_dissector.sh
 
 # Compile but not part of 'make run_tests'
-TEST_GEN_PROGS_EXTENDED = test_libbpf_open test_sock_addr 
test_skb_cgroup_id_user
+TEST_GEN_PROGS_EXTENDED = test_libbpf_open test_sock_addr 
test_skb_cgroup_id_user \
+   flow_dissector_load test_flow_dissector
 
 include ../lib.mk
 
diff --git a/tools/testing/selftests/bpf/config 
b/tools/testing/selftests/bpf/config
index b4994a94968b..3655508f95fd 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -18,3 +18,4 @@ CONFIG_CRYPTO_HMAC=m
 CONFIG_CRYPTO_SHA256=m
 CONFIG_VXLAN=y
 CONFIG_GENEVE=y
+CONFIG_NET_CLS_FLOWER=m
diff --git a/tools/testing/selftests/bpf/flow_dissector_load.c 
b/tools/testing/selftests/bpf/flow_dissector_load.c
new file mode 100644
index ..d3273b5b3173
--- /dev/null
+++ b/tools/testing/selftests/bpf/flow_dissector_load.c
@@ -0,0 +1,140 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+const char *cfg_pin_path = "/sys/fs/bpf/flow_dissector";
+const char *cfg_map_name = "jmp_table";
+bool cfg_attach = true;
+char *cfg_section_name;
+char *cfg_path_name;
+
+static void load_and_attach_program(void)
+{
+   struct bpf_program *prog, *main_prog;
+   struct bpf_map *prog_array;
+   int i, fd, prog_fd, ret;
+   struct bpf_object *obj;
+   int prog_array_fd;
+
+   ret = bpf_prog_load(cfg_path_name, BPF_PROG_TYPE_FLOW_DISSECTOR, &obj,
+   &prog_fd);
+   if (ret)
+   error(1, 0, "bpf_prog_load %s", cfg_path_name);
+
+   main_prog = bpf_object__find_program_by_title(obj, cfg_section_name);
+   if (!main_prog)
+   error(1, 0, "bpf_object__find_program_by_title %s",
+ cfg_section_name);
+
+   prog_fd = bpf_program__fd(main_prog);
+   if (prog_fd < 0)
+   error(1, 0, "bpf_program__fd");
+
+   prog_array = bpf_object__find_map_by_name(obj, cfg_map_name);
+   if (!prog_array)
+   error(1, 0, "bpf_object__find_map_by_name %s", cfg_map_name);
+
+   prog_array_fd = bpf_map__fd(prog_array);
+   if (prog_array_fd < 0)
+   error(1, 0, "bpf_map__fd %s", cfg_map_name);
+
+   i = 0;
+   bpf_object__for_each_program(prog, obj) {
+   fd = bpf_program__fd(prog);
+   if (fd < 0)
+   error(1, 0, "bpf_program__fd");
+
+   if (fd != prog_fd) {
+   printf("%d: %s\n", i, bpf_program__title(prog, false));
+   bpf_map_update_elem(prog_array_fd, &i, &fd, BPF_ANY);
+   ++i;
+   }
+   }
+
+   ret = bpf_prog_attach(prog_fd, 0 

Re: kernels > v4.12 oops/crash with ipsec-traffic: bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and remove the operation of dst_free()

2018-08-30 Thread Wolfgang Walter
Hello,

kernels > 4.12 do not work on one of our main routers. They crash as soon
as ipsec-tunnels are configured and ipsec-traffic actually flows.
 
Just configuring ipsec (that is starting strongswan) does not trigger the
oops.
 
I finally found time to bisect that. It bisected down to

b838d5e1c5b6e57b10ec8af2268824041e3ea911
ipv4: mark DST_NOGC and remove the operation of dst_free()

Now we have other machines which run just fine with the very same kernels
doing ipsec. They differ insofar as they have much less cores, do not use
the ixgbe driver, do not have 10G and terminate only a few tunnels instead
of hundreds.

I already tested distribution kernels > 4.12 from debian, they also crash.

All kernels I created in the bisection run fine if I didn't use ipsec.
The bad ones all oopsed/crashed exactly as vanilla 4.14 described above.


Here is the bisect-log:

# bad: [bebc6082da0a9f5d47a1ea2edc099bf671058bd4] Linux 4.14
# good: [69973b830859bc6529a7a0468ba0d80ee5117826] Linux 4.9
git bisect start 'v4.14' 'v4.9'
# good: [d82dd0e34d0347be201fd274dc84cd645dccc064] raid1: prefer disk without 
bad blocks
git bisect good d82dd0e34d0347be201fd274dc84cd645dccc064
# bad: [9967468c0a109644e4a1f5b39b39bf86fe7507a7] Merge branch 'akpm' (patches 
from Andrew)
git bisect bad 9967468c0a109644e4a1f5b39b39bf86fe7507a7
# bad: [17d9aa66b08de445645bd0688fc1635bed77a57b] Merge tag 
'iwlwifi-next-for-kalle-2017-06-30' of 
git://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/iwlwifi-next
git bisect bad 17d9aa66b08de445645bd0688fc1635bed77a57b
# good: [de4d195308ad589626571dbe5789cebf9695a204] Merge branch 
'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good de4d195308ad589626571dbe5789cebf9695a204
# good: [9376906c17fa975bf6a7ea9dd124be697bcda289] Merge branch 
'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good 9376906c17fa975bf6a7ea9dd124be697bcda289
# good: [40e86a3619a1e84ad73c716c943f65fc38eb1e28] iwlwifi: mvm: use 
scnprintf() instead of snprintf()
git bisect good 40e86a3619a1e84ad73c716c943f65fc38eb1e28
# bad: [c66f2091c9248ddf42504c74cd327ae8619b04a4] net/mlx5e: Prevent PFC call 
for non ethernet ports
git bisect bad c66f2091c9248ddf42504c74cd327ae8619b04a4
# good: [a090bd4ff8387c409732a8e059fbf264ea0bdd56] Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
git bisect good a090bd4ff8387c409732a8e059fbf264ea0bdd56
# good: [1947030645b6012aeee98da764d6dd47071a6aad] Merge branch 
'dsa-prefix-Global-macros'
git bisect good 1947030645b6012aeee98da764d6dd47071a6aad
# good: [69137ea60c9dad58773a1918de6c1b00b088520c] pktgen: Specify num packets 
per thread
git bisect good 69137ea60c9dad58773a1918de6c1b00b088520c
# good: [d24406c85d123df773bc4df88ad5da2233896919] udp: call dst_hold_safe() in 
udp_sk_rx_set_dst()
git bisect good d24406c85d123df773bc4df88ad5da2233896919
# bad: [5b7c9a8ff828287af5aebe93e707271bf1a82cc3] net: remove dst gc related 
code
git bisect bad 5b7c9a8ff828287af5aebe93e707271bf1a82cc3
# bad: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and 
remove the operation of dst_free()
git bisect bad b838d5e1c5b6e57b10ec8af2268824041e3ea911
# good: [4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36] net: introduce a new 
function dst_dev_put()
git bisect good 4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36
# good: [95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call dst_dev_put() 
properly
git bisect good 95c47f9cf5e028d1ae77dc6c767c1edc8a18025b
# good: [9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call dst_hold_safe() 
properly
git bisect good 9df16efadd2a8a82731dc76ff656c771e261827f
# first bad commit: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark 
DST_NOGC and remove the operation of dst_free()


In my first email I wrote >= 4.12, but I think 4.12 works. I bisected between
4.9 and 4.14 as we actually run 4.9 on the machine with the problem and 4.14
on most other routers.

I also tested 4.18.5 and it still shows this bug.


Regards,
-- 
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts


Re: [PATCH net-next] packet: add sockopt to ignore outgoing packets

2018-08-30 Thread Willem de Bruijn
On Thu, Aug 30, 2018 at 6:12 AM Vincent Whitchurch
 wrote:
>
> Currently, the only way to ignore outgoing packets on a packet socket is
> via the BPF filter.  With MSG_ZEROCOPY, packets that are looped into
> AF_PACKET are copied in dev_queue_xmit_nit(), and this copy happens even
> if the filter run from packet_rcv() would reject them.  So the presence
> of a packet socket on the interface takes away the benefits of
> MSG_ZEROCOPY, even if the packet socket is not interested in outgoing
> packets.  (Even when MSG_ZEROCOPY is not used, the skb is unnecessarily
> cloned, but the cost for that is much lower.)
>
> Add a socket option to allow AF_PACKET sockets to ignore outgoing
> packets to solve this.  Note that the *BSDs already have something
> similar: BIOCSSEESENT/BIOCSDIRECTION and BIOCSDIRFILT.
>
> The first intended user is lldpd.

Clear description of the use case, thanks. I don't see a simple alternative
to introducing a new socket option, either (a new ETH_P_xx protocol
wildcard different from ETH_P_ALL, perhaps).

> Signed-off-by: Vincent Whitchurch 
> ---
>  include/linux/netdevice.h  |  1 +
>  include/uapi/linux/if_packet.h |  1 +
>  net/core/dev.c |  3 +++
>  net/packet/af_packet.c | 15 +++
>  4 files changed, 20 insertions(+)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index ca5ab98053c8..8ef14d9edc58 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -2317,6 +2317,7 @@ static inline struct sk_buff 
> *call_gro_receive_sk(gro_receive_sk_t cb,
>
>  struct packet_type {
> __be16  type;   /* This is really htons(ether_type). 
> */
> +   boolignore_outgoing;
> struct net_device   *dev;   /* NULL is wildcarded here   
> */
> int (*func) (struct sk_buff *,
>  struct net_device *,
> diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
> index 67b61d91d89b..467b654bd4c7 100644
> --- a/include/uapi/linux/if_packet.h
> +++ b/include/uapi/linux/if_packet.h
> @@ -57,6 +57,7 @@ struct sockaddr_ll {
>  #define PACKET_QDISC_BYPASS20
>  #define PACKET_ROLLOVER_STATS  21
>  #define PACKET_FANOUT_DATA 22
> +#define PACKET_IGNORE_OUTGOING 23


>
>  #define PACKET_FANOUT_HASH 0
>  #define PACKET_FANOUT_LB   1
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 325fc5088370..0addb4f0abfe 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1947,6 +1947,9 @@ static inline bool skb_loop_sk(struct packet_type 
> *ptype, struct sk_buff *skb)
> if (!ptype->af_packet_priv || !skb->sk)
> return false;
>
> +   if (ptype->ignore_outgoing)
> +   return true;
> +

This probably does not belong in skb_loop_sk, but in
dev_queue_xmit_nit directly.

> if (ptype->id_match)
> return ptype->id_match(ptype, skb->sk);
> else if ((struct sock *)ptype->af_packet_priv == skb->sk)


[RFC PATCH net-next 0/4] net: batched receive in GRO path

2018-08-30 Thread Edward Cree
This series listifies part of GRO processing, in a manner which allows those
 packets which are not GROed (i.e. for which dev_gro_receive returns
 GRO_NORMAL) to be passed on to the listified regular receive path.
I have not listified dev_gro_receive() itself, or the per-protocol GRO
 callback, since GRO's need to hold packets on lists under napi->gro_hash
 makes keeping the packets on other lists awkward, and since the GRO control
 block state of held skbs can refer only to one 'new' skb at a time.
 Nonetheless the batching of the calling code yields some performance gains
 in the GRO case as well.

Herewith the performance figures obtained in a NetPerf TCP stream test (with
 four streams, and irqs bound to a single core):
net-next: 7.166 Gbit/s (sigma 0.435)
after #2: 7.715 Gbit/s (sigma 0.145) = datum + 7.7%
after #4: 7.890 Gbit/s (sigma 0.217) = datum + 10.1%
(Note that the 'net-next' results were distinctly bimodal, with two results
 of about 8 Gbit/s and the remaining ten around 7 Gbit/s.  I don't have a
 good explanation for this.)
And with GRO disabled through ethtool -K (thus simulating traffic which is
 not GRO-able but, being TCP, is still passed to the GRO entry point):
net-next: 4.756 Gbit/s (sigma 0.240)
after #4: 5.355 Gbit/s (sigma 0.232) = datum + 12.6%

Edward Cree (4):
  net: introduce list entry point for GRO
  sfc: use batched receive for GRO
  net: make listified RX functions return number of good packets
  net/core: handle GRO_NORMAL skbs as a list in napi_gro_receive_list

 drivers/net/ethernet/sfc/efx.c|  11 +++-
 drivers/net/ethernet/sfc/net_driver.h |   1 +
 drivers/net/ethernet/sfc/rx.c |  16 +-
 include/linux/netdevice.h |   6 +-
 include/net/ip.h  |   4 +-
 include/net/ipv6.h|   4 +-
 net/core/dev.c| 104 ++
 net/ipv4/ip_input.c   |  39 -
 net/ipv6/ip6_input.c  |  37 +++-
 9 files changed, 157 insertions(+), 65 deletions(-)



[RFC PATCH net-next 1/4] net: introduce list entry point for GRO

2018-08-30 Thread Edward Cree
Also export napi_frags_skb() so that drivers using the napi_gro_frags()
 interface can prepare their SKBs properly for submitting on such a list.

Signed-off-by: Edward Cree 
---
 include/linux/netdevice.h |  2 ++
 net/core/dev.c| 28 +++-
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ca5ab98053c8..a4f55142d3eb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3521,8 +3521,10 @@ int netif_receive_skb(struct sk_buff *skb);
 int netif_receive_skb_core(struct sk_buff *skb);
 void netif_receive_skb_list(struct list_head *head);
 gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb);
+int napi_gro_receive_list(struct napi_struct *napi, struct list_head *head);
 void napi_gro_flush(struct napi_struct *napi, bool flush_old);
 struct sk_buff *napi_get_frags(struct napi_struct *napi);
+struct sk_buff *napi_frags_skb(struct napi_struct *napi);
 gro_result_t napi_gro_frags(struct napi_struct *napi);
 struct packet_offload *gro_find_receive_by_type(__be16 type);
 struct packet_offload *gro_find_complete_by_type(__be16 type);
diff --git a/net/core/dev.c b/net/core/dev.c
index 325fc5088370..d1e55bef84eb 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5596,6 +5596,31 @@ gro_result_t napi_gro_receive(struct napi_struct *napi, 
struct sk_buff *skb)
 }
 EXPORT_SYMBOL(napi_gro_receive);
 
+/* Returns the number of SKBs on the list successfully received */
+int napi_gro_receive_list(struct napi_struct *napi, struct list_head *head)
+{
+   struct sk_buff *skb, *next;
+   gro_result_t result;
+   int kept = 0;
+
+   list_for_each_entry(skb, head, list) {
+   skb_mark_napi_id(skb, napi);
+   trace_napi_gro_receive_entry(skb);
+   skb_gro_reset_offset(skb);
+   }
+
+   list_for_each_entry_safe(skb, next, head, list) {
+   list_del(&skb->list);
+   skb->next = NULL;
+   result = dev_gro_receive(napi, skb);
+   result = napi_skb_finish(result, skb);
+   if (result != GRO_DROP)
+   kept++;
+   }
+   return kept;
+}
+EXPORT_SYMBOL(napi_gro_receive_list);
+
 static void napi_reuse_skb(struct napi_struct *napi, struct sk_buff *skb)
 {
if (unlikely(skb->pfmemalloc)) {
@@ -5667,7 +5692,7 @@ static gro_result_t napi_frags_finish(struct napi_struct 
*napi,
  * Drivers could call both napi_gro_frags() and napi_gro_receive()
  * We copy ethernet header into skb->data to have a common layout.
  */
-static struct sk_buff *napi_frags_skb(struct napi_struct *napi)
+struct sk_buff *napi_frags_skb(struct napi_struct *napi)
 {
struct sk_buff *skb = napi->skb;
const struct ethhdr *eth;
@@ -5703,6 +5728,7 @@ static struct sk_buff *napi_frags_skb(struct napi_struct 
*napi)
 
return skb;
 }
+EXPORT_SYMBOL(napi_frags_skb);
 
 gro_result_t napi_gro_frags(struct napi_struct *napi)
 {



[RFC PATCH net-next 2/4] sfc: use batched receive for GRO

2018-08-30 Thread Edward Cree
Signed-off-by: Edward Cree 
---
 drivers/net/ethernet/sfc/efx.c| 11 +--
 drivers/net/ethernet/sfc/net_driver.h |  1 +
 drivers/net/ethernet/sfc/rx.c | 16 +---
 3 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
index 330233286e78..dba13a28014c 100644
--- a/drivers/net/ethernet/sfc/efx.c
+++ b/drivers/net/ethernet/sfc/efx.c
@@ -263,9 +263,9 @@ static int efx_check_disabled(struct efx_nic *efx)
  */
 static int efx_process_channel(struct efx_channel *channel, int budget)
 {
+   struct list_head rx_list, gro_list;
struct efx_tx_queue *tx_queue;
-   struct list_head rx_list;
-   int spent;
+   int spent, gro_count;
 
if (unlikely(!channel->enabled))
return 0;
@@ -275,6 +275,10 @@ static int efx_process_channel(struct efx_channel 
*channel, int budget)
INIT_LIST_HEAD(&rx_list);
channel->rx_list = &rx_list;
 
+   EFX_WARN_ON_PARANOID(channel->gro_list != NULL);
+   INIT_LIST_HEAD(&gro_list);
+   channel->gro_list = &gro_list;
+
efx_for_each_channel_tx_queue(tx_queue, channel) {
tx_queue->pkts_compl = 0;
tx_queue->bytes_compl = 0;
@@ -300,6 +304,9 @@ static int efx_process_channel(struct efx_channel *channel, 
int budget)
/* Receive any packets we queued up */
netif_receive_skb_list(channel->rx_list);
channel->rx_list = NULL;
+   gro_count = napi_gro_receive_list(&channel->napi_str, 
channel->gro_list);
+   channel->irq_mod_score += gro_count * 2;
+   channel->gro_list = NULL;
 
return spent;
 }
diff --git a/drivers/net/ethernet/sfc/net_driver.h 
b/drivers/net/ethernet/sfc/net_driver.h
index 961b92979640..72addac7a84a 100644
--- a/drivers/net/ethernet/sfc/net_driver.h
+++ b/drivers/net/ethernet/sfc/net_driver.h
@@ -502,6 +502,7 @@ struct efx_channel {
unsigned int rx_pkt_index;
 
struct list_head *rx_list;
+   struct list_head *gro_list;
 
struct efx_rx_queue rx_queue;
struct efx_tx_queue tx_queue[EFX_TXQ_TYPES];
diff --git a/drivers/net/ethernet/sfc/rx.c b/drivers/net/ethernet/sfc/rx.c
index 396ff01298cd..0534a54048c6 100644
--- a/drivers/net/ethernet/sfc/rx.c
+++ b/drivers/net/ethernet/sfc/rx.c
@@ -453,9 +453,19 @@ efx_rx_packet_gro(struct efx_channel *channel, struct 
efx_rx_buffer *rx_buf,
 
skb_record_rx_queue(skb, channel->rx_queue.core_index);
 
-   gro_result = napi_gro_frags(napi);
-   if (gro_result != GRO_DROP)
-   channel->irq_mod_score += 2;
+   /* Pass the packet up */
+   if (channel->gro_list != NULL) {
+   /* Clear napi->skb and prepare skb for GRO */
+   skb = napi_frags_skb(napi);
+   if (skb)
+   /* Add to list, will pass up later */
+   list_add_tail(&skb->list, channel->gro_list);
+   } else {
+   /* No list, so pass it up now */
+   gro_result = napi_gro_frags(napi);
+   if (gro_result != GRO_DROP)
+   channel->irq_mod_score += 2;
+   }
 }
 
 /* Allocate and construct an SKB around page fragments */



[RFC PATCH net-next 3/4] net: make listified RX functions return number of good packets

2018-08-30 Thread Edward Cree
Signed-off-by: Edward Cree 
---
 include/linux/netdevice.h |  4 +--
 include/net/ip.h  |  4 +--
 include/net/ipv6.h|  4 +--
 net/core/dev.c| 63 +--
 net/ipv4/ip_input.c   | 39 ++---
 net/ipv6/ip6_input.c  | 37 +---
 6 files changed, 92 insertions(+), 59 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a4f55142d3eb..ba5f985688eb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2322,7 +2322,7 @@ struct packet_type {
 struct net_device *,
 struct packet_type *,
 struct net_device *);
-   void(*list_func) (struct list_head *,
+   int (*list_func) (struct list_head *,
  struct packet_type *,
  struct net_device *);
bool(*id_match)(struct packet_type *ptype,
@@ -3519,7 +3519,7 @@ int netif_rx(struct sk_buff *skb);
 int netif_rx_ni(struct sk_buff *skb);
 int netif_receive_skb(struct sk_buff *skb);
 int netif_receive_skb_core(struct sk_buff *skb);
-void netif_receive_skb_list(struct list_head *head);
+int netif_receive_skb_list(struct list_head *head);
 gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb);
 int napi_gro_receive_list(struct napi_struct *napi, struct list_head *head);
 void napi_gro_flush(struct napi_struct *napi, bool flush_old);
diff --git a/include/net/ip.h b/include/net/ip.h
index e44b1a44f67a..aab1f7eea1e1 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -152,8 +152,8 @@ int ip_build_and_send_pkt(struct sk_buff *skb, const struct 
sock *sk,
  struct ip_options_rcu *opt);
 int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
   struct net_device *orig_dev);
-void ip_list_rcv(struct list_head *head, struct packet_type *pt,
-struct net_device *orig_dev);
+int ip_list_rcv(struct list_head *head, struct packet_type *pt,
+   struct net_device *orig_dev);
 int ip_local_deliver(struct sk_buff *skb);
 int ip_mr_input(struct sk_buff *skb);
 int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb);
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index ff33f498c137..f15651eabfe0 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -914,8 +914,8 @@ static inline __be32 flowi6_get_flowlabel(const struct 
flowi6 *fl6)
 
 int ipv6_rcv(struct sk_buff *skb, struct net_device *dev,
 struct packet_type *pt, struct net_device *orig_dev);
-void ipv6_list_rcv(struct list_head *head, struct packet_type *pt,
-  struct net_device *orig_dev);
+int ipv6_list_rcv(struct list_head *head, struct packet_type *pt,
+ struct net_device *orig_dev);
 
 int ip6_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb);
 
diff --git a/net/core/dev.c b/net/core/dev.c
index d1e55bef84eb..ac9741273c62 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4920,24 +4920,27 @@ int netif_receive_skb_core(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(netif_receive_skb_core);
 
-static inline void __netif_receive_skb_list_ptype(struct list_head *head,
- struct packet_type *pt_prev,
- struct net_device *orig_dev)
+static inline int __netif_receive_skb_list_ptype(struct list_head *head,
+struct packet_type *pt_prev,
+struct net_device *orig_dev)
 {
struct sk_buff *skb, *next;
+   int kept = 0;
 
if (!pt_prev)
-   return;
+   return 0;
if (list_empty(head))
-   return;
+   return 0;
if (pt_prev->list_func != NULL)
-   pt_prev->list_func(head, pt_prev, orig_dev);
+   kept = pt_prev->list_func(head, pt_prev, orig_dev);
else
list_for_each_entry_safe(skb, next, head, list)
-   pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
+   if (pt_prev->func(skb, skb->dev, pt_prev, orig_dev) == 
NET_RX_SUCCESS)
+   kept++;
+   return kept;
 }
 
-static void __netif_receive_skb_list_core(struct list_head *head, bool 
pfmemalloc)
+static int __netif_receive_skb_list_core(struct list_head *head, bool 
pfmemalloc)
 {
/* Fast-path assumptions:
 * - There is no RX handler.
@@ -4954,6 +4957,7 @@ static void __netif_receive_skb_list_core(struct 
list_head *head, bool pfmemallo
struct net_device *od_curr = NULL;
struct list_head sublist;
struct sk_buff *skb, *next;
+   int kept = 0, ret;

[RFC PATCH net-next 4/4] net/core: handle GRO_NORMAL skbs as a list in napi_gro_receive_list

2018-08-30 Thread Edward Cree
Allows GRO-using drivers to get the benefits of batching for non-GROable
 traffic.

Signed-off-by: Edward Cree 
---
 net/core/dev.c | 19 ---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index ac9741273c62..f9391f76feb7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5615,6 +5615,7 @@ EXPORT_SYMBOL(napi_gro_receive);
 int napi_gro_receive_list(struct napi_struct *napi, struct list_head *head)
 {
struct sk_buff *skb, *next;
+   struct list_head sublist;
gro_result_t result;
int kept = 0;
 
@@ -5624,14 +5625,26 @@ int napi_gro_receive_list(struct napi_struct *napi, 
struct list_head *head)
skb_gro_reset_offset(skb);
}
 
+   INIT_LIST_HEAD(&sublist);
list_for_each_entry_safe(skb, next, head, list) {
list_del(&skb->list);
skb->next = NULL;
result = dev_gro_receive(napi, skb);
-   result = napi_skb_finish(result, skb);
-   if (result != GRO_DROP)
-   kept++;
+   if (result == GRO_NORMAL) {
+   list_add_tail(&skb->list, &sublist);
+   continue;
+   } else {
+   if (!list_empty(&sublist)) {
+   /* Handle the GRO_NORMAL skbs to prevent OoO */
+   kept += 
netif_receive_skb_list_internal(&sublist);
+   INIT_LIST_HEAD(&sublist);
+   }
+   result = napi_skb_finish(result, skb);
+   if (result != GRO_DROP)
+   kept++;
+   }
}
+   kept += netif_receive_skb_list_internal(&sublist);
return kept;
 }
 EXPORT_SYMBOL(napi_gro_receive_list);


Re: [PATCH v2 1/2] IB/ipoib: Use dev_port to expose network interface port numbers

2018-08-30 Thread Doug Ledford
On Thu, 2018-08-30 at 21:22 +0300, Arseny Maslennikov wrote:
> Some InfiniBand network devices have multiple ports on the same PCI
> function. This initializes the `dev_port' sysfs field of those
> network interfaces with their port number.
> 
> Prior to this the kernel erroneously used the `dev_id' sysfs
> field of those network interfaces to convey the port number to userspace.
> 
> The use of `dev_id' was considered correct until Linux 3.15,
> when another field, `dev_port', was defined for this particular
> purpose and `dev_id' was reserved for distinguishing stacked ifaces
> (e.g: VLANs) with the same hardware address as their parent device.
> 
> Similar fixes to net/mlx4_en and many other drivers, which started
> exporting this information through `dev_id' before 3.15, were accepted
> into the kernel 4 years ago.
> See 76a066f2a2a0 (`net/mlx4_en: Expose port number through sysfs').
> 
> Signed-off-by: Arseny Maslennikov 
> ---
>  drivers/infiniband/ulp/ipoib/ipoib_main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
> b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> index e3d28f9ad9c0..ba16a63ee303 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> @@ -1880,7 +1880,7 @@ static int ipoib_parent_init(struct net_device *ndev)
>  sizeof(union ib_gid));
>  
>   SET_NETDEV_DEV(priv->dev, priv->ca->dev.parent);
> - priv->dev->dev_id = priv->port - 1;
> + priv->dev->dev_port = priv->port - 1;

I don't know that we can't do this.  At least not yet.  Expose the new
item to make us compliant with the new docs, and deprecate the old sysfs
item, but we can't just yank the old item.  Existing tools/scripts might
(probably) rely on it (existing tools already special case IPoIB
interfaces and we'll need to make sure they don't special case this
element too).

-- 
Doug Ledford 
GPG KeyID: B826A3330E572FDD
Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

signature.asc
Description: This is a digitally signed message part


[PATCH net-next 0/2] netlink: multicast join notifications

2018-08-30 Thread Patrick Ruddy
This patch is an update to https://patchwork.ozlabs.org/patch/571127/. The
previous patch was based on sending multicast MAC addresses in the
netlink messages to allow the programming of hardware. It was agreed to
rework this to use RTM_NEW/DELLINK messages which were more appropriate
for layer 2 addresses.
In the interim period it has become apparent that the applications actually
needs to see the L3 multicast addresses which are joined for FORUS
processing so this patch has been reworked to send the L3 multicast
addresses using RTM_NEW/DELADDR.
These new multicast L3 netlink notifications should use the IFA_MULTICAST
address type but this has been dropped in favour of IFA_ADDRESS as during
testing it was noticed that some applications - notably getaddrinfo in
lib6c assume that there is an IFA_ADDRESS in a RTM_NEW/DELADDR and
blindly dereference it.
Finally the RTM_GETADDR for both address families has been modified to
include the multicast l3 addresses.

Patrick Ruddy (2):
  netlink: ipv4 IGMP join notifications
  netlink: ipv6 MLD join notifications

 include/linux/igmp.h |  2 +
 net/ipv4/devinet.c   | 39 +--
 net/ipv4/igmp.c  | 90 
 net/ipv6/addrconf.c  | 44 --
 net/ipv6/mcast.c | 66 
 5 files changed, 218 insertions(+), 23 deletions(-)

-- 
2.17.1



[net-next v2 00/15][pull request] 40GbE Intel Wired LAN Driver Updates 2018-08-30

2018-08-30 Thread Jeff Kirsher
This series contains updates to i40e, i40evf and virtchnl.

Jake implements helper functions to use an array to handle the queue
stats which reduces the boiler plate code as well as keep the complexity
localized to a few functions.

Paweł adds the ability to change a VF's MAC address from the host side
without having to reload the VF driver on the guest side.

Paul adds a check to ensure that the number of queues that the PF sends
to the VF is equal to or less than the maximum number of queues the VF
can support.

Mitch fixes an issue caught by GCC 8, where we need to not include the
terminating null in the length of the string for strncpy().

Lihong fixes a VF issue to ensure that it does not enter into
promiscuous mode when macvlan is added to the VF.  Fixed a potential
crash after a VF is removed, since the workqueue sync for the adminq
task was not being cancelled.

Harshitha fixes the type for field_flags in the virtchnl_filter struct.

Martyna removes an unnecessary check in a conditional if statement.

Björn fixes an issue reported by Jesper Dangaard Brouer, where the
driver was reporting incorrect statistics when XDP was enabled.

Jan fixes the potential reporting of incorrect speed settings.

Patryk fixed an issue where the flag
I40EVF_FLAG_AQ_ENABLE_VLAN_STRIPPING was getting set when any offload is
set via ethtool.  Resolved by only setting this flag when VLAN offload
is enabled.  Also ensure we hold the rtnl lock when we are clearing the
interrupt scheme.  Added a check when deleting the MAC address from the
VF to ensure that the MAC address was not set by the PF and if it was,
do not delete it. 

v2: updated patch 2 in the series based on community feedback from David
Miller to inline a function

The following are changes since commit f0259b6ac4a3d27f6b7e938d6fce367cea377063:
  Merge tag 'mac80211-next-for-davem-2018-08-29' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 40GbE

Björn Töpel (1):
  i40e: report correct statistics when XDP is enabled

Harshitha Ramamurthy (1):
  virtchnl: use u8 type for a field in the virtchnl_filter struct

Jacob Keller (3):
  i40e: convert queue stats to i40e_stats array
  i40e: move ethtool stats boiler plate code to i40e_ethtool_stats.h
  i40evf: update ethtool stats code and use helper functions

Jan Sokolowski (1):
  i40e: Check and correct speed values for link on open

Lihong Yang (2):
  i40evf: set IFF_UNICAST_FLT flag for the VF
  i40evf: cancel workqueue sync for adminq when a VF is removed

Martyna Szapar (1):
  i40e: static analysis report from community

Mitch Williams (1):
  i40e: use correct length for strncpy

Patryk Małek (3):
  i40evf: Don't enable vlan stripping when rx offload is turned on
  i40e: hold the rtnl lock on clearing interrupt scheme
  i40e: Prevent deleting MAC address from VF when set by PF

Paul M Stillwell Jr (1):
  i40evf: Validate the number of queues a PF sends

Paweł Jabłoński (1):
  i40evf: Change a VF mac without reloading the VF driver

 .../net/ethernet/intel/i40e/i40e_ethtool.c| 238 --
 .../ethernet/intel/i40e/i40e_ethtool_stats.h  | 221 
 drivers/net/ethernet/intel/i40e/i40e_main.c   |  61 +++--
 drivers/net/ethernet/intel/i40e/i40e_ptp.c|   3 +-
 .../ethernet/intel/i40e/i40e_virtchnl_pf.c|  18 +-
 .../intel/i40evf/i40e_ethtool_stats.h | 221 
 .../ethernet/intel/i40evf/i40evf_ethtool.c| 137 +-
 .../net/ethernet/intel/i40evf/i40evf_main.c   |  15 +-
 .../ethernet/intel/i40evf/i40evf_virtchnl.c   |  43 +++-
 include/linux/avf/virtchnl.h  |   2 +-
 10 files changed, 678 insertions(+), 281 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_ethtool_stats.h
 create mode 100644 drivers/net/ethernet/intel/i40evf/i40e_ethtool_stats.h

-- 
2.17.1



[net-next v2 03/15] i40evf: update ethtool stats code and use helper functions

2018-08-30 Thread Jeff Kirsher
From: Jacob Keller 

Fix a bug in the way we handled VF queues, by always showing stats for
the maximum number of queues, even if they aren't allocated. It is not
safe to change the number of strings reported to ethtool, as grabbing
statistics occurs over multiple ethtool ops for which the rtnl_lock()
cannot be held the entire time.

Avoid this by always reporting queue stats for the maximum number of
queues in the netdevice. Share some of the helper functionality for
adding stats with the PF code in i40e_ethtool_stats.h

This should reduce the chance of potential future bugs, and make adding
new statistics easier.

Note for the queue stats, unlike the PF driver we do not keep an array
of queue pointers, but an array of queues, so care must be taken to
avoid accessing queue memory that hasn't yet been allocated.

Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 .../intel/i40evf/i40e_ethtool_stats.h | 221 ++
 .../ethernet/intel/i40evf/i40evf_ethtool.c| 137 ++-
 2 files changed, 298 insertions(+), 60 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/i40evf/i40e_ethtool_stats.h

diff --git a/drivers/net/ethernet/intel/i40evf/i40e_ethtool_stats.h 
b/drivers/net/ethernet/intel/i40evf/i40e_ethtool_stats.h
new file mode 100644
index ..60b595dd8c39
--- /dev/null
+++ b/drivers/net/ethernet/intel/i40evf/i40e_ethtool_stats.h
@@ -0,0 +1,221 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2013 - 2018 Intel Corporation. */
+
+/* ethtool statistics helpers */
+
+/**
+ * struct i40e_stats - definition for an ethtool statistic
+ * @stat_string: statistic name to display in ethtool -S output
+ * @sizeof_stat: the sizeof() the stat, must be no greater than sizeof(u64)
+ * @stat_offset: offsetof() the stat from a base pointer
+ *
+ * This structure defines a statistic to be added to the ethtool stats buffer.
+ * It defines a statistic as offset from a common base pointer. Stats should
+ * be defined in constant arrays using the I40E_STAT macro, with every element
+ * of the array using the same _type for calculating the sizeof_stat and
+ * stat_offset.
+ *
+ * The @sizeof_stat is expected to be sizeof(u8), sizeof(u16), sizeof(u32) or
+ * sizeof(u64). Other sizes are not expected and will produce a WARN_ONCE from
+ * the i40e_add_ethtool_stat() helper function.
+ *
+ * The @stat_string is interpreted as a format string, allowing formatted
+ * values to be inserted while looping over multiple structures for a given
+ * statistics array. Thus, every statistic string in an array should have the
+ * same type and number of format specifiers, to be formatted by variadic
+ * arguments to the i40e_add_stat_string() helper function.
+ **/
+struct i40e_stats {
+   char stat_string[ETH_GSTRING_LEN];
+   int sizeof_stat;
+   int stat_offset;
+};
+
+/* Helper macro to define an i40e_stat structure with proper size and type.
+ * Use this when defining constant statistics arrays. Note that @_type expects
+ * only a type name and is used multiple times.
+ */
+#define I40E_STAT(_type, _name, _stat) { \
+   .stat_string = _name, \
+   .sizeof_stat = FIELD_SIZEOF(_type, _stat), \
+   .stat_offset = offsetof(_type, _stat) \
+}
+
+/* Helper macro for defining some statistics directly copied from the netdev
+ * stats structure.
+ */
+#define I40E_NETDEV_STAT(_net_stat) \
+   I40E_STAT(struct rtnl_link_stats64, #_net_stat, _net_stat)
+
+/* Helper macro for defining some statistics related to queues */
+#define I40E_QUEUE_STAT(_name, _stat) \
+   I40E_STAT(struct i40e_ring, _name, _stat)
+
+/* Stats associated with a Tx or Rx ring */
+static const struct i40e_stats i40e_gstrings_queue_stats[] = {
+   I40E_QUEUE_STAT("%s-%u.packets", stats.packets),
+   I40E_QUEUE_STAT("%s-%u.bytes", stats.bytes),
+};
+
+/**
+ * i40evf_add_one_ethtool_stat - copy the stat into the supplied buffer
+ * @data: location to store the stat value
+ * @pointer: basis for where to copy from
+ * @stat: the stat definition
+ *
+ * Copies the stat data defined by the pointer and stat structure pair into
+ * the memory supplied as data. Used to implement i40e_add_ethtool_stats and
+ * i40evf_add_queue_stats. If the pointer is null, data will be zero'd.
+ */
+static inline void
+i40evf_add_one_ethtool_stat(u64 *data, void *pointer,
+   const struct i40e_stats *stat)
+{
+   char *p;
+
+   if (!pointer) {
+   /* ensure that the ethtool data buffer is zero'd for any stats
+* which don't have a valid pointer.
+*/
+   *data = 0;
+   return;
+   }
+
+   p = (char *)pointer + stat->stat_offset;
+   switch (stat->sizeof_stat) {
+   case sizeof(u64):
+   *data = *((u64 *)p);
+   break;
+   case sizeof(u32):
+   *data = *((u32 *)p);
+   break;
+   case sizeof(u16):
+

[net-next v2 09/15] i40e: static analysis report from community

2018-08-30 Thread Jeff Kirsher
From: Martyna Szapar 

Static analysis tools report a problem from original driver submission.
Removing unnecessary check in condition.

Signed-off-by: Martyna Szapar 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index ac685ad4d877..62f2b5bce6bb 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -13033,7 +13033,7 @@ struct i40e_veb *i40e_veb_setup(struct i40e_pf *pf, u16 
flags,
for (vsi_idx = 0; vsi_idx < pf->num_alloc_vsi; vsi_idx++)
if (pf->vsi[vsi_idx] && pf->vsi[vsi_idx]->seid == vsi_seid)
break;
-   if (vsi_idx >= pf->num_alloc_vsi && vsi_seid != 0) {
+   if (vsi_idx == pf->num_alloc_vsi && vsi_seid != 0) {
dev_info(&pf->pdev->dev, "vsi seid %d not found\n",
 vsi_seid);
return NULL;
-- 
2.17.1



[net-next v2 12/15] i40evf: Don't enable vlan stripping when rx offload is turned on

2018-08-30 Thread Jeff Kirsher
From: Patryk Małek 

With current implementation of i40evf_set_features when user sets
any offload via ethtool we set I40EVF_FLAG_AQ_ENABLE_VLAN_STRIPPING
as a required aq which triggers driver to call
i40evf_enable_vlan_stripping. This shouldn't take place.
This patches fixes it by setting the flag only when VLAN offload
is turned on.

Signed-off-by: Patryk Małek 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40evf/i40evf_main.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
index 07f187b5bcac..c7048cf484fb 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
@@ -3120,18 +3120,19 @@ static int i40evf_set_features(struct net_device 
*netdev,
 {
struct i40evf_adapter *adapter = netdev_priv(netdev);
 
-   /* Don't allow changing VLAN_RX flag when VLAN is set for VF
-* and return an error in this case
+   /* Don't allow changing VLAN_RX flag when adapter is not capable
+* of VLAN offload
 */
-   if (VLAN_ALLOWED(adapter)) {
+   if (!VLAN_ALLOWED(adapter)) {
+   if ((netdev->features ^ features) & NETIF_F_HW_VLAN_CTAG_RX)
+   return -EINVAL;
+   } else if ((netdev->features ^ features) & NETIF_F_HW_VLAN_CTAG_RX) {
if (features & NETIF_F_HW_VLAN_CTAG_RX)
adapter->aq_required |=
I40EVF_FLAG_AQ_ENABLE_VLAN_STRIPPING;
else
adapter->aq_required |=
I40EVF_FLAG_AQ_DISABLE_VLAN_STRIPPING;
-   } else if ((netdev->features ^ features) & NETIF_F_HW_VLAN_CTAG_RX) {
-   return -EINVAL;
}
 
return 0;
-- 
2.17.1



[net-next v2 14/15] i40evf: cancel workqueue sync for adminq when a VF is removed

2018-08-30 Thread Jeff Kirsher
From: Lihong Yang 

If a VF is being removed, there is no need to continue with the
workqueue sync for the adminq task, thus cancel it. Without this call,
when VFs are created and removed right away, there might be a chance for
the driver to crash with events stuck in the adminq.

Signed-off-by: Lihong Yang 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40evf/i40evf_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
index c7048cf484fb..174d1da2857b 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
@@ -3910,6 +3910,8 @@ static void i40evf_remove(struct pci_dev *pdev)
if (adapter->watchdog_timer.function)
del_timer_sync(&adapter->watchdog_timer);
 
+   cancel_work_sync(&adapter->adminq_task);
+
i40evf_free_rss(adapter);
 
if (hw->aq.asq.count)
-- 
2.17.1



[net-next v2 05/15] i40evf: Validate the number of queues a PF sends

2018-08-30 Thread Jeff Kirsher
From: Paul M Stillwell Jr 

A PF can send any number of queues to the VF and the VF may not
be able to support that many. Check to see that the number of
queues is less than or equal to the max number of queues the
VF can have.

Signed-off-by: Paul M Stillwell Jr 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 .../ethernet/intel/i40evf/i40evf_virtchnl.c   | 32 +++
 1 file changed, 32 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
index 79b7be83b5c3..6579dabab78c 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
@@ -153,6 +153,32 @@ int i40evf_send_vf_config_msg(struct i40evf_adapter 
*adapter)
  NULL, 0);
 }
 
+/**
+ * i40evf_validate_num_queues
+ * @adapter: adapter structure
+ *
+ * Validate that the number of queues the PF has sent in
+ * VIRTCHNL_OP_GET_VF_RESOURCES is not larger than the VF can handle.
+ **/
+static void i40evf_validate_num_queues(struct i40evf_adapter *adapter)
+{
+   if (adapter->vf_res->num_queue_pairs > I40EVF_MAX_REQ_QUEUES) {
+   struct virtchnl_vsi_resource *vsi_res;
+   int i;
+
+   dev_info(&adapter->pdev->dev, "Received %d queues, but can only 
have a max of %d\n",
+adapter->vf_res->num_queue_pairs,
+I40EVF_MAX_REQ_QUEUES);
+   dev_info(&adapter->pdev->dev, "Fixing by reducing queues to 
%d\n",
+I40EVF_MAX_REQ_QUEUES);
+   adapter->vf_res->num_queue_pairs = I40EVF_MAX_REQ_QUEUES;
+   for (i = 0; i < adapter->vf_res->num_vsis; i++) {
+   vsi_res = &adapter->vf_res->vsi_res[i];
+   vsi_res->num_queue_pairs = I40EVF_MAX_REQ_QUEUES;
+   }
+   }
+}
+
 /**
  * i40evf_get_vf_config
  * @adapter: private adapter structure
@@ -195,6 +221,11 @@ int i40evf_get_vf_config(struct i40evf_adapter *adapter)
err = (i40e_status)le32_to_cpu(event.desc.cookie_low);
memcpy(adapter->vf_res, event.msg_buf, min(event.msg_len, len));
 
+   /* some PFs send more queues than we should have so validate that
+* we aren't getting too many queues
+*/
+   if (!err)
+   i40evf_validate_num_queues(adapter);
i40e_vf_parse_hw_config(hw, adapter->vf_res);
 out_alloc:
kfree(event.msg_buf);
@@ -1329,6 +1360,7 @@ void i40evf_virtchnl_completion(struct i40evf_adapter 
*adapter,
  I40E_MAX_VF_VSI *
  sizeof(struct virtchnl_vsi_resource);
memcpy(adapter->vf_res, msg, min(msglen, len));
+   i40evf_validate_num_queues(adapter);
i40e_vf_parse_hw_config(&adapter->hw, adapter->vf_res);
if (is_zero_ether_addr(adapter->hw.mac.addr)) {
/* restore current mac address */
-- 
2.17.1



[net-next v2 02/15] i40e: move ethtool stats boiler plate code to i40e_ethtool_stats.h

2018-08-30 Thread Jeff Kirsher
From: Jacob Keller 

Move the boiler plate structures and helper functions we recently
added into their own header file, so that the complete collection is
located together.

Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
v2: Based on feedback from David Miller, inline the
__i40e_add_stat_strings function

 .../net/ethernet/intel/i40e/i40e_ethtool.c| 182 +--
 .../ethernet/intel/i40e/i40e_ethtool_stats.h  | 221 ++
 2 files changed, 222 insertions(+), 181 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_ethtool_stats.h

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 235012b3bd42..d7d3974beca2 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -6,25 +6,8 @@
 #include "i40e.h"
 #include "i40e_diag.h"
 
-struct i40e_stats {
-   /* The stat_string is expected to be a format string formatted using
-* vsnprintf by i40e_add_stat_strings. Every member of a stats array
-* should use the same format specifiers as they will be formatted
-* using the same variadic arguments.
-*/
-   char stat_string[ETH_GSTRING_LEN];
-   int sizeof_stat;
-   int stat_offset;
-};
-
-#define I40E_STAT(_type, _name, _stat) { \
-   .stat_string = _name, \
-   .sizeof_stat = FIELD_SIZEOF(_type, _stat), \
-   .stat_offset = offsetof(_type, _stat) \
-}
+#include "i40e_ethtool_stats.h"
 
-#define I40E_NETDEV_STAT(_net_stat) \
-   I40E_STAT(struct rtnl_link_stats64, #_net_stat, _net_stat)
 #define I40E_PF_STAT(_name, _stat) \
I40E_STAT(struct i40e_pf, _name, _stat)
 #define I40E_VSI_STAT(_name, _stat) \
@@ -173,11 +156,6 @@ static const struct i40e_stats i40e_gstrings_pfc_stats[] = 
{
I40E_PFC_STAT("port.rx_priority_%u_xon_2_xoff", priority_xon_2_xoff),
 };
 
-static const struct i40e_stats i40e_gstrings_queue_stats[] = {
-   I40E_QUEUE_STAT("%s-%u.packets", stats.packets),
-   I40E_QUEUE_STAT("%s-%u.bytes", stats.bytes),
-};
-
 #define I40E_NETDEV_STATS_LEN  ARRAY_SIZE(i40e_gstrings_net_stats)
 
 #define I40E_MISC_STATS_LENARRAY_SIZE(i40e_gstrings_misc_stats)
@@ -1749,128 +1727,6 @@ static int i40e_get_sset_count(struct net_device 
*netdev, int sset)
}
 }
 
-/**
- * i40e_add_one_ethtool_stat - copy the stat into the supplied buffer
- * @data: location to store the stat value
- * @pointer: basis for where to copy from
- * @stat: the stat definition
- *
- * Copies the stat data defined by the pointer and stat structure pair into
- * the memory supplied as data. Used to implement i40e_add_ethtool_stats and
- * i40e_add_queue_stats. If the pointer is null, data will be zero'd.
- */
-static inline void
-i40e_add_one_ethtool_stat(u64 *data, void *pointer,
- const struct i40e_stats *stat)
-{
-   char *p;
-
-   if (!pointer) {
-   /* ensure that the ethtool data buffer is zero'd for any stats
-* which don't have a valid pointer.
-*/
-   *data = 0;
-   return;
-   }
-
-   p = (char *)pointer + stat->stat_offset;
-   switch (stat->sizeof_stat) {
-   case sizeof(u64):
-   *data = *((u64 *)p);
-   break;
-   case sizeof(u32):
-   *data = *((u32 *)p);
-   break;
-   case sizeof(u16):
-   *data = *((u16 *)p);
-   break;
-   case sizeof(u8):
-   *data = *((u8 *)p);
-   break;
-   default:
-   WARN_ONCE(1, "unexpected stat size for %s",
- stat->stat_string);
-   *data = 0;
-   }
-}
-
-/**
- * __i40e_add_ethtool_stats - copy stats into the ethtool supplied buffer
- * @data: ethtool stats buffer
- * @pointer: location to copy stats from
- * @stats: array of stats to copy
- * @size: the size of the stats definition
- *
- * Copy the stats defined by the stats array using the pointer as a base into
- * the data buffer supplied by ethtool. Updates the data pointer to point to
- * the next empty location for successive calls to __i40e_add_ethtool_stats.
- * If pointer is null, set the data values to zero and update the pointer to
- * skip these stats.
- **/
-static inline void
-__i40e_add_ethtool_stats(u64 **data, void *pointer,
-const struct i40e_stats stats[],
-const unsigned int size)
-{
-   unsigned int i;
-
-   for (i = 0; i < size; i++)
-   i40e_add_one_ethtool_stat((*data)++, pointer, &stats[i]);
-}
-
-/**
- * i40e_add_ethtool_stats - copy stats into ethtool supplied buffer
- * @data: ethtool stats buffer
- * @pointer: location where stats are stored
- * @stats: static const array of stat definitions
- *
- * Macro to ease the use of __i40e_add_ethtool_stats by taking a static
- * constant stat

[net-next v2 06/15] i40e: use correct length for strncpy

2018-08-30 Thread Jeff Kirsher
From: Mitch Williams 

Caught by GCC 8. When we provide a length for strncpy, we should not
include the terminating null. So we must tell it one less than the size
of the destination buffer.

Signed-off-by: Mitch Williams 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_ptp.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ptp.c 
b/drivers/net/ethernet/intel/i40e/i40e_ptp.c
index 35f2866b38c6..1199f0502d6d 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ptp.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ptp.c
@@ -694,7 +694,8 @@ static long i40e_ptp_create_clock(struct i40e_pf *pf)
if (!IS_ERR_OR_NULL(pf->ptp_clock))
return 0;
 
-   strncpy(pf->ptp_caps.name, i40e_driver_name, sizeof(pf->ptp_caps.name));
+   strncpy(pf->ptp_caps.name, i40e_driver_name,
+   sizeof(pf->ptp_caps.name) - 1);
pf->ptp_caps.owner = THIS_MODULE;
pf->ptp_caps.max_adj = 9;
pf->ptp_caps.n_ext_ts = 0;
-- 
2.17.1



[net-next v2 10/15] i40e: report correct statistics when XDP is enabled

2018-08-30 Thread Jeff Kirsher
From: Björn Töpel 

When XDP is enabled, the driver will report incorrect
statistics. Received frames will reported as transmitted frames.

This commits fixes the i40e implementation of ndo_get_stats64 (struct
net_device_ops), so that iproute2 will report correct statistics
(e.g. when running "ip -stats link show dev eth0") even when XDP is
enabled.

Reported-by: Jesper Dangaard Brouer 
Fixes: 74608d17fe29 ("i40e: add support for XDP_TX action")
Signed-off-by: Björn Töpel 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 24 +++--
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 62f2b5bce6bb..6cb4076e8fba 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -420,9 +420,9 @@ static void i40e_get_netdev_stats_struct(struct net_device 
*netdev,
  struct rtnl_link_stats64 *stats)
 {
struct i40e_netdev_priv *np = netdev_priv(netdev);
-   struct i40e_ring *tx_ring, *rx_ring;
struct i40e_vsi *vsi = np->vsi;
struct rtnl_link_stats64 *vsi_stats = i40e_get_vsi_stats_struct(vsi);
+   struct i40e_ring *ring;
int i;
 
if (test_bit(__I40E_VSI_DOWN, vsi->state))
@@ -436,24 +436,26 @@ static void i40e_get_netdev_stats_struct(struct 
net_device *netdev,
u64 bytes, packets;
unsigned int start;
 
-   tx_ring = READ_ONCE(vsi->tx_rings[i]);
-   if (!tx_ring)
+   ring = READ_ONCE(vsi->tx_rings[i]);
+   if (!ring)
continue;
-   i40e_get_netdev_stats_struct_tx(tx_ring, stats);
+   i40e_get_netdev_stats_struct_tx(ring, stats);
 
-   rx_ring = &tx_ring[1];
+   if (i40e_enabled_xdp_vsi(vsi)) {
+   ring++;
+   i40e_get_netdev_stats_struct_tx(ring, stats);
+   }
 
+   ring++;
do {
-   start = u64_stats_fetch_begin_irq(&rx_ring->syncp);
-   packets = rx_ring->stats.packets;
-   bytes   = rx_ring->stats.bytes;
-   } while (u64_stats_fetch_retry_irq(&rx_ring->syncp, start));
+   start   = u64_stats_fetch_begin_irq(&ring->syncp);
+   packets = ring->stats.packets;
+   bytes   = ring->stats.bytes;
+   } while (u64_stats_fetch_retry_irq(&ring->syncp, start));
 
stats->rx_packets += packets;
stats->rx_bytes   += bytes;
 
-   if (i40e_enabled_xdp_vsi(vsi))
-   i40e_get_netdev_stats_struct_tx(&rx_ring[1], stats);
}
rcu_read_unlock();
 
-- 
2.17.1



[net-next v2 08/15] virtchnl: use u8 type for a field in the virtchnl_filter struct

2018-08-30 Thread Jeff Kirsher
From: Harshitha Ramamurthy 

The virtchnl_filter struct has a field called field_flags. A previous
commit mistakenly had the type to be a __u8. What we want is for the
field to be an unsigned 8 bit value, so let's just use the existing
kernel type u8 for that.

Signed-off-by: Harshitha Ramamurthy 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 include/linux/avf/virtchnl.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/avf/virtchnl.h b/include/linux/avf/virtchnl.h
index 212b3822d180..b41f7bc958ef 100644
--- a/include/linux/avf/virtchnl.h
+++ b/include/linux/avf/virtchnl.h
@@ -573,7 +573,7 @@ struct virtchnl_filter {
enumvirtchnl_flow_type flow_type;
enumvirtchnl_action action;
u32 action_meta;
-   __u8field_flags;
+   u8  field_flags;
 };
 
 VIRTCHNL_CHECK_STRUCT_LEN(272, virtchnl_filter);
-- 
2.17.1



[net-next v2 01/15] i40e: convert queue stats to i40e_stats array

2018-08-30 Thread Jeff Kirsher
From: Jacob Keller 

Use an i40e_stats array to handle the queue stats, instead of coding
similar functionality separately. Because of how the queue stats are
accessed on some kernels, we can't easily use i40e_add_ethtool_stats.

Instead, implement a separate helper, i40e_add_queue_stats, which we'll
use instead. This helper will correctly implement the
u64_stats_fetch_begin_irq logic and allow retries until successful. We
share the most complex code by re-using i40e_add_one_ethtool_stat.

This logic additionally easily supports skipping disabled rings by using
a ternary operator before calling the u64_stats_fetch_begin_irq()
function, so that we correctly zero-out the stats values without having
to perform two separate sections of code.

This significantly reduces the boiler plate code in
i40e_get_ethtool_stats, and helps keep the complex logic contained to as
few functions as possible.

With this patch, we've finally converted all the statistics to use the
helpers and the i40e_stats function.

Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 .../net/ethernet/intel/i40e/i40e_ethtool.c| 148 +++---
 1 file changed, 89 insertions(+), 59 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 5ff6caa83948..235012b3bd42 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -33,6 +33,8 @@ struct i40e_stats {
I40E_STAT(struct i40e_veb, _name, _stat)
 #define I40E_PFC_STAT(_name, _stat) \
I40E_STAT(struct i40e_pfc_stats, _name, _stat)
+#define I40E_QUEUE_STAT(_name, _stat) \
+   I40E_STAT(struct i40e_ring, _name, _stat)
 
 static const struct i40e_stats i40e_gstrings_net_stats[] = {
I40E_NETDEV_STAT(rx_packets),
@@ -171,20 +173,16 @@ static const struct i40e_stats i40e_gstrings_pfc_stats[] 
= {
I40E_PFC_STAT("port.rx_priority_%u_xon_2_xoff", priority_xon_2_xoff),
 };
 
-/* We use num_tx_queues here as a proxy for the maximum number of queues
- * available because we always allocate queues symmetrically.
- */
-#define I40E_MAX_NUM_QUEUES(n) ((n)->num_tx_queues)
-#define I40E_QUEUE_STATS_LEN(n)  \
-  (I40E_MAX_NUM_QUEUES(n)   \
-   * 2 /* Tx and Rx together */ \
-   * (sizeof(struct i40e_queue_stats) / sizeof(u64)))
-#define I40E_GLOBAL_STATS_LEN  ARRAY_SIZE(i40e_gstrings_stats)
+static const struct i40e_stats i40e_gstrings_queue_stats[] = {
+   I40E_QUEUE_STAT("%s-%u.packets", stats.packets),
+   I40E_QUEUE_STAT("%s-%u.bytes", stats.bytes),
+};
+
 #define I40E_NETDEV_STATS_LEN  ARRAY_SIZE(i40e_gstrings_net_stats)
+
 #define I40E_MISC_STATS_LENARRAY_SIZE(i40e_gstrings_misc_stats)
-#define I40E_VSI_STATS_LEN(n)  (I40E_NETDEV_STATS_LEN + \
-I40E_MISC_STATS_LEN + \
-I40E_QUEUE_STATS_LEN((n)))
+
+#define I40E_VSI_STATS_LEN (I40E_NETDEV_STATS_LEN + I40E_MISC_STATS_LEN)
 
 #define I40E_PFC_STATS_LEN (ARRAY_SIZE(i40e_gstrings_pfc_stats) * \
 I40E_MAX_USER_PRIORITY)
@@ -193,10 +191,15 @@ static const struct i40e_stats i40e_gstrings_pfc_stats[] 
= {
 (ARRAY_SIZE(i40e_gstrings_veb_tc_stats) * \
  I40E_MAX_TRAFFIC_CLASS))
 
-#define I40E_PF_STATS_LEN(n)   (I40E_GLOBAL_STATS_LEN + \
+#define I40E_GLOBAL_STATS_LEN  ARRAY_SIZE(i40e_gstrings_stats)
+
+#define I40E_PF_STATS_LEN  (I40E_GLOBAL_STATS_LEN + \
 I40E_PFC_STATS_LEN + \
 I40E_VEB_STATS_LEN + \
-I40E_VSI_STATS_LEN((n)))
+I40E_VSI_STATS_LEN)
+
+/* Length of stats for a single queue */
+#define I40E_QUEUE_STATS_LEN   ARRAY_SIZE(i40e_gstrings_queue_stats)
 
 enum i40e_ethtool_test_id {
I40E_ETH_TEST_REG = 0,
@@ -1701,11 +1704,30 @@ static int i40e_get_stats_count(struct net_device 
*netdev)
struct i40e_netdev_priv *np = netdev_priv(netdev);
struct i40e_vsi *vsi = np->vsi;
struct i40e_pf *pf = vsi->back;
+   int stats_len;
 
if (vsi == pf->vsi[pf->lan_vsi] && pf->hw.partition_id == 1)
-   return I40E_PF_STATS_LEN(netdev);
+   stats_len = I40E_PF_STATS_LEN;
else
-   return I40E_VSI_STATS_LEN(netdev);
+   stats_len = I40E_VSI_STATS_LEN;
+
+   /* The number of stats reported for a given net_device must remain
+* constant throughout the life of that device.
+*
+* This is because the API for obtaining the size, strings, and stats
+* is spread out over three separate ethtool ioctls. There is no safe
+* way to lock the number of stats across these calls, so we must
+ 

[net-next v2 07/15] i40evf: set IFF_UNICAST_FLT flag for the VF

2018-08-30 Thread Jeff Kirsher
From: Lihong Yang 

Set IFF_UNICAST_FLT flag for the VF to prevent it from entering
promiscuous mode when macvlan is added to the VF.

Signed-off-by: Lihong Yang 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40evf/i40evf_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
index 5906c1c1d19d..07f187b5bcac 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
@@ -3358,6 +3358,8 @@ int i40evf_process_config(struct i40evf_adapter *adapter)
if (vfres->vf_cap_flags & VIRTCHNL_VF_OFFLOAD_VLAN)
netdev->features |= NETIF_F_HW_VLAN_CTAG_FILTER;
 
+   netdev->priv_flags |= IFF_UNICAST_FLT;
+
/* Do not turn on offloads when they are requested to be turned off.
 * TSO needs minimum 576 bytes to work correctly.
 */
-- 
2.17.1



[net-next v2 11/15] i40e: Check and correct speed values for link on open

2018-08-30 Thread Jeff Kirsher
From: Jan Sokolowski 

If our card has been put in an unstable state due to
other drivers interacting with it, speed settings
might be incorrect. If incorrect, forcefully reset them
on open to known default values.

Signed-off-by: Jan Sokolowski 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 27 ++---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 6cb4076e8fba..c30feb27e1c0 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -6570,6 +6570,24 @@ static i40e_status i40e_force_link_state(struct i40e_pf 
*pf, bool is_up)
struct i40e_hw *hw = &pf->hw;
i40e_status err;
u64 mask;
+   u8 speed;
+
+   /* Card might've been put in an unstable state by other drivers
+* and applications, which causes incorrect speed values being
+* set on startup. In order to clear speed registers, we call
+* get_phy_capabilities twice, once to get initial state of
+* available speeds, and once to get current PHY config.
+*/
+   err = i40e_aq_get_phy_capabilities(hw, false, true, &abilities,
+  NULL);
+   if (err) {
+   dev_err(&pf->pdev->dev,
+   "failed to get phy cap., ret =  %s last_status =  %s\n",
+   i40e_stat_str(hw, err),
+   i40e_aq_str(hw, hw->aq.asq_last_status));
+   return err;
+   }
+   speed = abilities.link_speed;
 
/* Get the current phy config */
err = i40e_aq_get_phy_capabilities(hw, false, false, &abilities,
@@ -6583,9 +6601,9 @@ static i40e_status i40e_force_link_state(struct i40e_pf 
*pf, bool is_up)
}
 
/* If link needs to go up, but was not forced to go down,
-* no need for a flap
+* and its speed values are OK, no need for a flap
 */
-   if (is_up && abilities.phy_type != 0)
+   if (is_up && abilities.phy_type != 0 && abilities.link_speed != 0)
return I40E_SUCCESS;
 
/* To force link we need to set bits for all supported PHY types,
@@ -6597,7 +6615,10 @@ static i40e_status i40e_force_link_state(struct i40e_pf 
*pf, bool is_up)
config.phy_type_ext = is_up ? (u8)((mask >> 32) & 0xff) : 0;
/* Copy the old settings, except of phy_type */
config.abilities = abilities.abilities;
-   config.link_speed = abilities.link_speed;
+   if (abilities.link_speed != 0)
+   config.link_speed = abilities.link_speed;
+   else
+   config.link_speed = speed;
config.eee_capability = abilities.eee_capability;
config.eeer = abilities.eeer_val;
config.low_power_ctrl = abilities.d3_lpan;
-- 
2.17.1



[net-next v2 13/15] i40e: hold the rtnl lock on clearing interrupt scheme

2018-08-30 Thread Jeff Kirsher
From: Patryk Małek 

Hold the rtnl lock when we're clearing interrupt scheme
in i40e_shutdown and in i40e_remove.

Signed-off-by: Patryk Małek 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index c30feb27e1c0..112245f32d7d 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -14182,6 +14182,7 @@ static void i40e_remove(struct pci_dev *pdev)
mutex_destroy(&hw->aq.asq_mutex);
 
/* Clear all dynamic memory lists of rings, q_vectors, and VSIs */
+   rtnl_lock();
i40e_clear_interrupt_scheme(pf);
for (i = 0; i < pf->num_alloc_vsi; i++) {
if (pf->vsi[i]) {
@@ -14190,6 +14191,7 @@ static void i40e_remove(struct pci_dev *pdev)
pf->vsi[i] = NULL;
}
}
+   rtnl_unlock();
 
for (i = 0; i < I40E_MAX_VEB; i++) {
kfree(pf->veb[i]);
@@ -14401,7 +14403,13 @@ static void i40e_shutdown(struct pci_dev *pdev)
wr32(hw, I40E_PFPM_WUFC,
 (pf->wol_en ? I40E_PFPM_WUFC_MAG_MASK : 0));
 
+   /* Since we're going to destroy queues during the
+* i40e_clear_interrupt_scheme() we should hold the RTNL lock for this
+* whole section
+*/
+   rtnl_lock();
i40e_clear_interrupt_scheme(pf);
+   rtnl_unlock();
 
if (system_state == SYSTEM_POWER_OFF) {
pci_wake_from_d3(pdev, pf->wol_en);
-- 
2.17.1



[net-next v2 15/15] i40e: Prevent deleting MAC address from VF when set by PF

2018-08-30 Thread Jeff Kirsher
From: Patryk Małek 

To prevent VF from deleting MAC address that was assigned by the
PF we need to check for that scenario when we try to delete a MAC
address from a VF.

Signed-off-by: Patryk Małek 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c 
b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index f56c645588f3..3e707c7e6782 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -2569,6 +2569,16 @@ static int i40e_vc_del_mac_addr_msg(struct i40e_vf *vf, 
u8 *msg, u16 msglen)
ret = I40E_ERR_INVALID_MAC_ADDR;
goto error_param;
}
+
+   if (vf->pf_set_mac &&
+   ether_addr_equal(al->list[i].addr,
+vf->default_lan_addr.addr)) {
+   dev_err(&pf->pdev->dev,
+   "MAC addr %pM has been set by PF, cannot delete 
it for VF %d, reset VF to change MAC addr\n",
+   vf->default_lan_addr.addr, vf->vf_id);
+   ret = I40E_ERR_PARAM;
+   goto error_param;
+   }
}
vsi = pf->vsi[vf->lan_vsi_idx];
 
-- 
2.17.1



[net-next v2 04/15] i40evf: Change a VF mac without reloading the VF driver

2018-08-30 Thread Jeff Kirsher
From: Paweł Jabłoński 

Add possibility to change a VF mac address from host side
without reloading the VF driver on the guest side. Without
this patch it is not possible to change the VF mac because
executing i40evf_virtchnl_completion function with
VIRTCHNL_OP_GET_VF_RESOURCES opcode resets the VF mac
address to previous value.

Signed-off-by: Paweł Jabłoński 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c  |  8 +---
 drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c | 11 +--
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c 
b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index c6d24eaede18..f56c645588f3 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -2458,7 +2458,7 @@ static inline int i40e_check_vf_permission(struct i40e_vf 
*vf,
!is_multicast_ether_addr(addr) && vf->pf_set_mac &&
!ether_addr_equal(addr, vf->default_lan_addr.addr)) {
dev_err(&pf->pdev->dev,
-   "VF attempting to override administratively set 
MAC address, reload the VF driver to resume normal operation\n");
+   "VF attempting to override administratively set 
MAC address, bring down and up the VF interface to resume normal operation\n");
return -EPERM;
}
}
@@ -3873,9 +3873,11 @@ int i40e_ndo_set_vf_mac(struct net_device *netdev, int 
vf_id, u8 *mac)
 mac, vf_id);
}
 
-   /* Force the VF driver stop so it has to reload with new MAC address */
+   /* Force the VF interface down so it has to bring up with new MAC
+* address
+*/
i40e_vc_disable_vf(vf);
-   dev_info(&pf->pdev->dev, "Reload the VF driver to make this change 
effective.\n");
+   dev_info(&pf->pdev->dev, "Bring down and up the VF interface to make 
this change effective.\n");
 
 error_param:
return ret;
diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
index 565677de5ba3..79b7be83b5c3 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
@@ -1330,8 +1330,15 @@ void i40evf_virtchnl_completion(struct i40evf_adapter 
*adapter,
  sizeof(struct virtchnl_vsi_resource);
memcpy(adapter->vf_res, msg, min(msglen, len));
i40e_vf_parse_hw_config(&adapter->hw, adapter->vf_res);
-   /* restore current mac address */
-   ether_addr_copy(adapter->hw.mac.addr, netdev->dev_addr);
+   if (is_zero_ether_addr(adapter->hw.mac.addr)) {
+   /* restore current mac address */
+   ether_addr_copy(adapter->hw.mac.addr, netdev->dev_addr);
+   } else {
+   /* refresh current mac address if changed */
+   ether_addr_copy(netdev->dev_addr, adapter->hw.mac.addr);
+   ether_addr_copy(netdev->perm_addr,
+   adapter->hw.mac.addr);
+   }
i40evf_process_config(adapter);
}
break;
-- 
2.17.1



[PATCH net] net/ipv6: Only update MTU metric if it set

2018-08-30 Thread dsahern
From: David Ahern 

Jan reported a regression after an update to 4.18.5. In this case ipv6
default route is setup by systemd-networkd based on data from an RA. The
RA contains an MTU of 1492 which is used when the route is first inserted
but then systemd-networkd pushes down updates to the default route
without the mtu set.

Prior to the change to fib6_info, metrics such as MTU were held in the
dst_entry and rt6i_pmtu in rt6_info contained an update to the mtu if
any. ip6_mtu would look at rt6i_pmtu first and use it if set. If not,
the value from the metrics is used if it is set and finally falling
back to the idev value.

After the fib6_info change metrics are contained in the fib6_info struct
and there is no equivalent to rt6i_pmtu. To maintain consistency with
the old behavior the new code should only reset the MTU in the metrics
if the route update has it set.

Fixes: d4ead6b34b67 ("net/ipv6: move metrics from dst to rt6_info")
Reported-by: Jan Janssen 
Signed-off-by: David Ahern 
---
 net/ipv6/ip6_fib.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index d212738e9d10..f43d278e0040 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -987,7 +987,10 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct 
fib6_info *rt,
fib6_clean_expires(iter);
else
fib6_set_expires(iter, rt->expires);
-   fib6_metric_set(iter, RTAX_MTU, rt->fib6_pmtu);
+
+   if (rt->fib6_pmtu)
+   fib6_metric_set(iter, RTAX_MTU,
+   rt->fib6_pmtu);
return -EEXIST;
}
/* If we have the same destination and the same metric,
-- 
2.15.2 (Apple Git-101.1)



Re: [PATCH net] net/ipv6: Only update MTU metric if it set

2018-08-30 Thread Jan Janssen
On Donnerstag, 30. August 2018 23:15:43 CEST dsah...@kernel.org wrote:
> From: David Ahern 
> 
> Jan reported a regression after an update to 4.18.5. In this case ipv6
> default route is setup by systemd-networkd based on data from an RA. The
> RA contains an MTU of 1492 which is used when the route is first inserted
> but then systemd-networkd pushes down updates to the default route
> without the mtu set.
> 
> Prior to the change to fib6_info, metrics such as MTU were held in the
> dst_entry and rt6i_pmtu in rt6_info contained an update to the mtu if
> any. ip6_mtu would look at rt6i_pmtu first and use it if set. If not,
> the value from the metrics is used if it is set and finally falling
> back to the idev value.
> 
> After the fib6_info change metrics are contained in the fib6_info struct
> and there is no equivalent to rt6i_pmtu. To maintain consistency with
> the old behavior the new code should only reset the MTU in the metrics
> if the route update has it set.
> 
> Fixes: d4ead6b34b67 ("net/ipv6: move metrics from dst to rt6_info")
> Reported-by: Jan Janssen 
> Signed-off-by: David Ahern 
> ---
>  net/ipv6/ip6_fib.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> index d212738e9d10..f43d278e0040 100644
> --- a/net/ipv6/ip6_fib.c
> +++ b/net/ipv6/ip6_fib.c
> @@ -987,7 +987,10 @@ static int fib6_add_rt2node(struct fib6_node *fn,
> struct fib6_info *rt, fib6_clean_expires(iter);
>   else
>   fib6_set_expires(iter, rt->expires);
> - fib6_metric_set(iter, RTAX_MTU, rt->fib6_pmtu);
> +
> + if (rt->fib6_pmtu)
> + fib6_metric_set(iter, RTAX_MTU,
> + rt->fib6_pmtu);
>   return -EEXIST;
>   }
>   /* If we have the same destination and the same metric,

I just tested this and can confirm that it fixes my problem.

You mentioned that systemd-networkd is doing something silly here, so I was 
wondering if you could make a bug report to fix the underlying issue? I don't 
mind doing it myself but you're the expert and can explain things better.

Thank you for fixing this.




Re: [PATCH 1/2] dt-bindings: net: cpsw: Document cpsw-phy-sel usage but prefer phandle

2018-08-30 Thread Tony Lindgren
* Grygorii Strashko  [180830 17:08]:
> On 08/29/2018 07:47 PM, Tony Lindgren wrote:
> > In general, it seems cpsw is just an interconnect instance
> > (L4_FAST) with a control module (CPSW_WR) and a pile of
> > independent other modules. That's described nicely in
> > am437x TRM chapter "2.1.4 L4 Fast Peripheral Memory Map".
> > So from that point of view the binding reg entries right
> > now are all wrong :)
> 
> TRM not consistent - for am5 it's one MMIO region.

Well that same information is there in 57xx TRM in chapter
"Table 26-1454. GMAC_SW Instance Summary". But yeah, all
the cpsw internal devices are stuffed into a single
interconnect target module.

> > In the long run cpsw should be really treated as an
> > interconnect instance with it's control module providing
> > standard Linux framework services such as clock /
> > regulator / phy / pinctrl / iio whatever for the other
> > modules.
> > 
> > Just my 2c based on looking at the interconnect, I'm
> > not too familiar with cpsw otherwise.
> 
> It's not separate modules. this is composite module which have only one
> fck/ick and most of blocks can't even function without each other.
> Above might be the case for Keystone 2, but not omap CPSW.
> Keystone 2 - has packet processor, security accelerator, queue manager in
> addition to its basic switch block.

Yeah there's just one fck/ick as it's all in a single
interconnect module. But you might want to look at the
CPSW_WR device registers and see what gate clocks and other
Linux generic subsystem services CPSW_WR could provide for
the other cpsw internal devices. It might just make your
life easier maintaining all these variants ;)

Regards,

Tony




Re: [PATCH net] net/ipv6: Only update MTU metric if it set

2018-08-30 Thread David Ahern
On 8/30/18 3:49 PM, Jan Janssen wrote:
> You mentioned that systemd-networkd is doing something silly here, so I was 
> wondering if you could make a bug report to fix the underlying issue? I don't 
> mind doing it myself but you're the expert and can explain things better.

It is silly in the sense that systemd-networkd installs a default route
with an MTU and then pushes down updates (route replace) without the MTU.


[bpf PATCH] bpf: avoid kcm psock use and tcp_bpf from colliding

2018-08-30 Thread John Fastabend
Currently we check sk_user_data is non NULL to determine if the sk
exists in a map. However, this is not sufficient to ensure the psock
is not in use by another (non-ULP TCP) user, such as kcm. To avoid
this when adding a sock to a map also verify it is of the correct ULP
type.

Signed-off-by: John Fastabend 
---
 0 files changed

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index ce63e58..1c05794 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -1808,6 +1808,11 @@ static int sock_map_delete_elem(struct bpf_map *map, 
void *key)
return 0;
 }
 
+static bool psock_is_smap_sk(struct sock *sk)
+{
+   return inet_csk(sk)->icsk_ulp_ops == &bpf_tcp_ulp_ops;
+}
+
 /* Locking notes: Concurrent updates, deletes, and lookups are allowed and are
  * done inside rcu critical sections. This ensures on updates that the psock
  * will not be released via smap_release_sock() until concurrent 
updates/deletes
@@ -1892,6 +1897,10 @@ static int __sock_map_ctx_update_elem(struct bpf_map 
*map,
 * doesn't update user data.
 */
if (psock) {
+   if (!psock_is_smap_sk(sock)) {
+   err = -EBUSY;
+   goto out_progs;
+   }
if (READ_ONCE(psock->bpf_parse) && parse) {
err = -EBUSY;
goto out_progs;



[RFC PATCH] bpf: test case for kcm + sockmap

2018-08-30 Thread John Fastabend
This finds the issue with kcm and sockmap where when a kcm socket is
added to a sockmap we fail to notice it is a kcm sock and not a tcp
bpf ULP sock. This results in a refcount_inc_not_zero() being called
on the psock which is assigned to a smap_psock incorrectly. On my
system it just happens to fail there and returns EAGAIN.

With the fix, just submitted for bpf tree, we get the correct EBUSY
error.

https://patchwork.ozlabs.org/project/netdev/list/?series=63393

RFC for now and will submit properly when bpf fix makes its way into
bpf-next. But, wanted to get the test on the list for folks to look
at.

---

John Fastabend (1):
  bpf: test_maps add a test to catch kcm + sockmap


 tools/testing/selftests/bpf/Makefile|2 -
 tools/testing/selftests/bpf/test_maps.c |   64 ++-
 2 files changed, 63 insertions(+), 3 deletions(-)

--
Signature


[RFC PATCH] bpf: test_maps add a test to catch kcm + sockmap

2018-08-30 Thread John Fastabend
RFC for now, once the fix is in bpf-next will submit there.

Adding a socket to both sockmap and kcm is not supported due to
collision on sk_user_data usage. At some point we will fix this
but until then lets ensure we don't corrupt things.

Now if selftests is run without KCM support we will issue a
warning and continue with the tests.

Signed-off-by: John Fastabend 
---
 tools/testing/selftests/bpf/Makefile|2 -
 tools/testing/selftests/bpf/test_maps.c |   64 ++-
 2 files changed, 63 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index fff7fb1..16f5dcc 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -35,7 +35,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o 
test_tcp_estats.o test
test_get_stack_rawtp.o test_sockmap_kern.o test_sockhash_kern.o \
test_lwt_seg6local.o sendmsg4_prog.o sendmsg6_prog.o 
test_lirc_mode2_kern.o \
get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
-   test_skb_cgroup_id_kern.o
+   test_skb_cgroup_id_kern.o sockmap_kcm.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
diff --git a/tools/testing/selftests/bpf/test_maps.c 
b/tools/testing/selftests/bpf/test_maps.c
index 6f54f84..473b690 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -479,14 +480,16 @@ static void test_devmap(int task, void *data)
 #define SOCKMAP_PARSE_PROG "./sockmap_parse_prog.o"
 #define SOCKMAP_VERDICT_PROG "./sockmap_verdict_prog.o"
 #define SOCKMAP_TCP_MSG_PROG "./sockmap_tcp_msg_prog.o"
+#define KCM_PROG "./sockmap_kcm.o"
 static void test_sockmap(int tasks, void *data)
 {
struct bpf_map *bpf_map_rx, *bpf_map_tx, *bpf_map_msg, *bpf_map_break;
-   int map_fd_msg = 0, map_fd_rx = 0, map_fd_tx = 0, map_fd_break;
+   int map_fd_msg = 0, map_fd_rx = 0, map_fd_tx = 0, map_fd_break, kcm;
int ports[] = {50200, 50201, 50202, 50204};
int err, i, fd, udp, sfd[6] = {0xdeadbeef};
u8 buf[20] = {0x0, 0x5, 0x3, 0x2, 0x1, 0x0};
-   int parse_prog, verdict_prog, msg_prog;
+   int parse_prog, verdict_prog, msg_prog, kcm_prog;
+   struct kcm_attach attach_info;
struct sockaddr_in addr;
int one = 1, s, sc, rc;
struct bpf_object *obj;
@@ -740,6 +743,62 @@ static void test_sockmap(int tasks, void *data)
goto out_sockmap;
}
 
+   /* Test adding a KCM socket into map */
+#define AF_KCM 41
+   kcm = socket(AF_KCM, SOCK_DGRAM, KCMPROTO_CONNECTED);
+   if (kcm == -1) {
+   printf("Warning, KCM+Sockmap could not be tested.\n");
+   goto skip_kcm;
+   }
+
+   err = bpf_prog_load(KCM_PROG,
+   BPF_PROG_TYPE_SOCKET_FILTER,
+   &obj, &kcm_prog);
+   if (err) {
+   printf("Failed to load SK_SKB parse prog\n");
+   goto out_sockmap;
+   }
+
+   i = 2;
+   memset(&attach_info, 0, sizeof(attach_info));
+   attach_info.fd = sfd[i];
+   attach_info.bpf_fd = kcm_prog;
+   err = ioctl(kcm, SIOCKCMATTACH, &attach_info);
+   if (!err) {
+   perror("Failed KCM attached to sockmap fd: ");
+   goto out_sockmap;
+   }
+
+   err = bpf_map_delete_elem(fd, &i);
+   if (err) {
+   printf("Failed delete sockmap from empty map %i %i\n", err, 
errno);
+   goto out_sockmap;
+   }
+
+   err = ioctl(kcm, SIOCKCMATTACH, &attach_info);
+   if (err) {
+   perror("Failed KCM attach");
+   goto out_sockmap;
+   }
+
+   err = bpf_map_update_elem(fd, &i, &sfd[i], BPF_ANY);
+   if (!err) {
+   printf("Failed sockmap attached KCM sock!\n");
+   goto out_sockmap;
+   }
+   err = ioctl(kcm, SIOCKCMUNATTACH, &attach_info);
+   if (err) {
+   printf("Failed detattch KCM sock!\n");
+   goto out_sockmap;
+   }
+
+   err = bpf_map_update_elem(fd, &i, &sfd[i], BPF_ANY);
+   if (err) {
+   printf("Failed post-kcm update sockmap '%i:%i'\n",
+  i, sfd[i]);
+   goto out_sockmap;
+   }
+
/* Test map update elem afterwards fd lives in fd and map_fd */
for (i = 0; i < 6; i++) {
err = bpf_map_update_elem(map_fd_rx, &i, &sfd[i], BPF_ANY);
@@ -772,6 +831,7 @@ static void test_sockmap(int tasks, void *data)
}
}
 
+skip_kcm:
/* Put sfd[2] (sending fd below) into msg map to test sendmsg bpf */
i = 0;
err = bpf_map_update_elem(map_fd_msg, &i, &sfd[2], BPF_ANY);



Re: [bpf PATCH] bpf: avoid kcm psock use and tcp_bpf from colliding

2018-08-30 Thread John Fastabend
On 08/30/2018 03:33 PM, John Fastabend wrote:
> Currently we check sk_user_data is non NULL to determine if the sk
> exists in a map. However, this is not sufficient to ensure the psock
> is not in use by another (non-ULP TCP) user, such as kcm. To avoid
> this when adding a sock to a map also verify it is of the correct ULP
> type.
> 
> Signed-off-by: John Fastabend 
> ---

I'll send a v2 of this we have one more spot we need to check
the psock type in the error path.

Thanks,
John


[PATCH bpf-next] xsk: remove unnecessary assignment

2018-08-30 Thread Prashant Bhole
Since xdp_umem_query() was added one assignment of bpf.command was
missed from cleanup. Removing the assignment statement.

Fixes: 84c6b86875e01a0 ("xsk: don't allow umem replace at stack level")
Signed-off-by: Prashant Bhole 
---
 net/xdp/xdp_umem.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index bfe2dbea480b..d179732617dc 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -76,8 +76,6 @@ int xdp_umem_assign_dev(struct xdp_umem *umem, struct 
net_device *dev,
if (!dev->netdev_ops->ndo_bpf || !dev->netdev_ops->ndo_xsk_async_xmit)
return force_zc ? -EOPNOTSUPP : 0; /* fail or fallback */
 
-   bpf.command = XDP_QUERY_XSK_UMEM;
-
rtnl_lock();
err = xdp_umem_query(dev, queue_id);
if (err) {
-- 
2.17.1




[PATCH bpf-next] samples/bpf: xdpsock, minor fixes

2018-08-30 Thread Prashant Bhole
- xsks_map size was fixed to 4, changed it MAX_SOCKS
- Remove redundant definition of MAX_SOCKS in xdpsock_user.c
- In dump_stats(), add NULL check for xsks[i]

Signed-off-by: Prashant Bhole 
---
 samples/bpf/xdpsock_kern.c | 2 +-
 samples/bpf/xdpsock_user.c | 3 +--
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/samples/bpf/xdpsock_kern.c b/samples/bpf/xdpsock_kern.c
index d8806c41362e..b8ccd0802b3f 100644
--- a/samples/bpf/xdpsock_kern.c
+++ b/samples/bpf/xdpsock_kern.c
@@ -16,7 +16,7 @@ struct bpf_map_def SEC("maps") xsks_map = {
.type = BPF_MAP_TYPE_XSKMAP,
.key_size = sizeof(int),
.value_size = sizeof(int),
-   .max_entries = 4,
+   .max_entries = MAX_SOCKS,
 };
 
 struct bpf_map_def SEC("maps") rr_map = {
diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
index b3906111bedb..57ecadc58403 100644
--- a/samples/bpf/xdpsock_user.c
+++ b/samples/bpf/xdpsock_user.c
@@ -118,7 +118,6 @@ struct xdpsock {
unsigned long prev_tx_npkts;
 };
 
-#define MAX_SOCKS 4
 static int num_socks;
 struct xdpsock *xsks[MAX_SOCKS];
 
@@ -596,7 +595,7 @@ static void dump_stats(void)
 
prev_time = now;
 
-   for (i = 0; i < num_socks; i++) {
+   for (i = 0; i < num_socks && xsks[i]; i++) {
char *fmt = "%-15s %'-11.0f %'-11lu\n";
double rx_pps, tx_pps;
 
-- 
2.17.1




Re: [PATCH bpf-next 0/3] xsk: misc code cleanup

2018-08-30 Thread Alexei Starovoitov
On Thu, Aug 30, 2018 at 03:56:40PM +0200, Magnus Karlsson wrote:
> This patch set cleans up two code style issues with the xsk zero-copy
> code. The resulting code is smaller and simpler.
> 
> Patch 1: Removes a potential compiler warning reported by the Intel
>  0-DAY kernel test infrastructure.
> Patches 2-3: Removes the xdp_umem_props structure. At some point, it
>  was used to break a dependency, but the members are these
>  days much better off in the xdp_umem since the dependency
>  does not exist anymore.
> 
> I based this patch set on bpf-next commit 234dbe3dc1db ("Merge branch
> 'verifier-liveness-simplification'")

Applied, Thanks



Re: [PATCH v3 bpf-next 0/2] bpf tcp save syn set/get sockoptions

2018-08-30 Thread Alexei Starovoitov
On Thu, Aug 30, 2018 at 07:51:52AM -0700, Nikita V. Shirokov wrote:
> 
> adding supprot for two new bpf's tcp sockopts:
> TCP_SAVE_SYN (set) and TCP_SAVED_SYN (get)
> this would allow for tcp-bpf program to build some logic based on fields from
> ingress syn packet (e.g. doing tcp's tos/tclass reflection (see sample prog))
> and do it transparently from userspace program point of view

Applied, Thanks
but please convert the sample code into selftest.



[PATCH net-next] net: remove duplicated include from net_failover.c

2018-08-30 Thread YueHaibing
Remove duplicated include.

Signed-off-by: YueHaibing 
---
 drivers/net/net_failover.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/net_failover.c b/drivers/net/net_failover.c
index 7ae1856..192ae1c 100644
--- a/drivers/net/net_failover.c
+++ b/drivers/net/net_failover.c
@@ -19,7 +19,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 



[PATCH net-next] failover: remove set but not used variable 'primary_dev'

2018-08-30 Thread YueHaibing
Fixes gcc '-Wunused-but-set-variable' warning:

drivers/net/net_failover.c: In function 'net_failover_slave_unregister':
drivers/net/net_failover.c:598:35: warning:
 variable 'primary_dev' set but not used [-Wunused-but-set-variable]

Signed-off-by: YueHaibing 
---
 drivers/net/net_failover.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/net_failover.c b/drivers/net/net_failover.c
index 7ae1856..e103c94e 100644
--- a/drivers/net/net_failover.c
+++ b/drivers/net/net_failover.c
@@ -595,12 +595,11 @@ static int net_failover_slave_pre_unregister(struct 
net_device *slave_dev,
 static int net_failover_slave_unregister(struct net_device *slave_dev,
 struct net_device *failover_dev)
 {
-   struct net_device *standby_dev, *primary_dev;
+   struct net_device *standby_dev;
struct net_failover_info *nfo_info;
bool slave_is_standby;
 
nfo_info = netdev_priv(failover_dev);
-   primary_dev = rtnl_dereference(nfo_info->primary_dev);
standby_dev = rtnl_dereference(nfo_info->standby_dev);
 
vlan_vids_del_by_dev(slave_dev, failover_dev);



  1   2   >