date:20181101





W dniu 01.11.2018 o 22:18, Paweł Staszewski pisze:



W dniu 01.11.2018 o 21:37, Saeed Mahameed pisze:

On Thu, 2018-11-01 at 12:09 +0100, Paweł Staszewski wrote:

W dniu 01.11.2018 o 10:50, Saeed Mahameed pisze:

On Wed, 2018-10-31 at 22:57 +0100, Paweł Staszewski wrote:

Hi

So maybee someone will be interested how linux kernel handles
normal
traffic (not pktgen :) )


Server HW configuration:

CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz

NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)


Server software:

FRR - as routing daemon

enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to
local
numa
node)

enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local
numa
node)


Maximum traffic that server can handle:

Bandwidth

    bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
 input: /proc/net/dev type: rate
 \ iface   Rx Tx    Total
=

=
  enp175s0f1:  28.51 Gb/s   37.24
Gb/s
65.74 Gb/s
  enp175s0f0:  38.07 Gb/s   28.44
Gb/s
66.51 Gb/s
---

---
   total:  66.58 Gb/s   65.67
Gb/s
132.25 Gb/s


Packets per second:

    bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
 input: /proc/net/dev type: rate
 - iface   Rx Tx    Total
=

=
  enp175s0f1:  5248589.00 P/s   3486617.75 P/s
8735207.00 P/s
  enp175s0f0:  3557944.25 P/s   5232516.00 P/s
8790460.00 P/s
---

---
   total:  8806533.00 P/s   8719134.00 P/s
17525668.00 P/s


After reaching that limits nics on the upstream side (more RX
traffic)
start to drop packets


I just dont understand that server can't handle more bandwidth
(~40Gbit/s is limit where all cpu's are 100% util) - where pps on
RX
side are increasing.


Where do you see 40 Gb/s ? you showed that both ports on the same
NIC (
same pcie link) are doing  66.58 Gb/s (RX) + 65.67 Gb/s (TX) =
132.25
Gb/s which aligns with your pcie link limit, what am i missing ?

hmm yes that was my concern also - cause cant find anywhere
informations
about that bandwidth is uni or bidirectional - so if 126Gbit for x16
8GT
is unidir - then bidir will be 126/2 ~68Gbit - which will fit total
bw
on both ports

i think it is bidir


This can explain maybee also why cpuload is rising rapidly from
120Gbit/s in total to 132Gbit (counters of bwmng are from /proc/net -
so
there can be some error in reading them when offloading (gro/gso/tso)
on
nic's is enabled that is why


Was thinking that maybee reached some pcie x16 limit - but x16
8GT
is
126Gbit - and also when testing with pktgen i can reach more bw
and
pps
(like 4x more comparing to normal internet traffic)


Are you forwarding when using pktgen as well or you just testing
the RX
side pps ?

Yes pktgen was tested on single port RX
Can check also forwarding to eliminate pciex limits


So this explains why you have more RX pps, since tx is idle and pcie
will be free to do only rx.

[...]



ethtool -S enp175s0f1
NIC statistics:
    rx_packets: 173730800927
    rx_bytes: 99827422751332
    tx_packets: 142532009512
    tx_bytes: 184633045911222
    tx_tso_packets: 25989113891
    tx_tso_bytes: 132933363384458
    tx_tso_inner_packets: 0
    tx_tso_inner_bytes: 0
    tx_added_vlan_packets: 74630239613
    tx_nop: 2029817748
    rx_lro_packets: 0
    rx_lro_bytes: 0
    rx_ecn_mark: 0
    rx_removed_vlan_packets: 173730800927
    rx_csum_unnecessary: 0
    rx_csum_none: 434357
    rx_csum_complete: 173730366570
    rx_csum_unnecessary_inner: 0
    rx_xdp_drop: 0
    rx_xdp_redirect: 0
    rx_xdp_tx_xmit: 0
    rx_xdp_tx_full: 0
    rx_xdp_tx_err: 0
    rx_xdp_tx_cqe: 0
    tx_csum_none: 38260960853
    tx_csum_partial: 36369278774
    tx_csum_partial_inner: 0
    tx_queue_stopped: 1
    tx_queue_dropped: 0
    tx_xmit_more: 748638099
    tx_recover: 0
    tx_cqes: 73881645031
    tx_queue_wake: 1
    tx_udp_seg_rem: 0
    tx_cqe_err: 0
    tx_xdp_xmit: 0
    tx_xdp_full: 0
    tx_xdp_err: 0
    tx_xdp_cqes: 0
    rx_wqe_err: 0
    rx_mpwqe_filler_cqes: 0
    rx_mpwqe_filler_strides: 0
    rx_buff_alloc_err: 0
    rx_cqe_compress_blks: 0
    rx_cqe_compress_pkts: 0

If this is a pcie bottleneck it might be useful to  enable CQE
compression (to reduce PCIe completion descriptors transactions)
you should see the above rx_cqe_compress_pkts increasing when
enabled.

$ ethtool  --set-priv-flags enp175s0f1 rx_cqe_compress on
$ ethtool --show-priv-flags

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic





W dniu 01.11.2018 o 22:24, Paweł Staszewski pisze:



W dniu 01.11.2018 o 22:18, Paweł Staszewski pisze:



W dniu 01.11.2018 o 21:37, Saeed Mahameed pisze:

On Thu, 2018-11-01 at 12:09 +0100, Paweł Staszewski wrote:

W dniu 01.11.2018 o 10:50, Saeed Mahameed pisze:

On Wed, 2018-10-31 at 22:57 +0100, Paweł Staszewski wrote:

Hi

So maybee someone will be interested how linux kernel handles
normal
traffic (not pktgen :) )


Server HW configuration:

CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz

NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)


Server software:

FRR - as routing daemon

enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to
local
numa
node)

enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local
numa
node)


Maximum traffic that server can handle:

Bandwidth

    bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
 input: /proc/net/dev type: rate
 \ iface   Rx Tx    Total
=

=
  enp175s0f1:  28.51 Gb/s 37.24
Gb/s
65.74 Gb/s
  enp175s0f0:  38.07 Gb/s 28.44
Gb/s
66.51 Gb/s
---

---
   total:  66.58 Gb/s 65.67
Gb/s
132.25 Gb/s


Packets per second:

    bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
 input: /proc/net/dev type: rate
 - iface   Rx Tx    Total
=

=
  enp175s0f1:  5248589.00 P/s 3486617.75 P/s
8735207.00 P/s
  enp175s0f0:  3557944.25 P/s 5232516.00 P/s
8790460.00 P/s
---

---
   total:  8806533.00 P/s 8719134.00 P/s
17525668.00 P/s


After reaching that limits nics on the upstream side (more RX
traffic)
start to drop packets


I just dont understand that server can't handle more bandwidth
(~40Gbit/s is limit where all cpu's are 100% util) - where pps on
RX
side are increasing.


Where do you see 40 Gb/s ? you showed that both ports on the same
NIC (
same pcie link) are doing  66.58 Gb/s (RX) + 65.67 Gb/s (TX) =
132.25
Gb/s which aligns with your pcie link limit, what am i missing ?

hmm yes that was my concern also - cause cant find anywhere
informations
about that bandwidth is uni or bidirectional - so if 126Gbit for x16
8GT
is unidir - then bidir will be 126/2 ~68Gbit - which will fit total
bw
on both ports

i think it is bidir


This can explain maybee also why cpuload is rising rapidly from
120Gbit/s in total to 132Gbit (counters of bwmng are from /proc/net -
so
there can be some error in reading them when offloading (gro/gso/tso)
on
nic's is enabled that is why


Was thinking that maybee reached some pcie x16 limit - but x16
8GT
is
126Gbit - and also when testing with pktgen i can reach more bw
and
pps
(like 4x more comparing to normal internet traffic)


Are you forwarding when using pktgen as well or you just testing
the RX
side pps ?

Yes pktgen was tested on single port RX
Can check also forwarding to eliminate pciex limits


So this explains why you have more RX pps, since tx is idle and pcie
will be free to do only rx.

[...]



ethtool -S enp175s0f1
NIC statistics:
    rx_packets: 173730800927
    rx_bytes: 99827422751332
    tx_packets: 142532009512
    tx_bytes: 184633045911222
    tx_tso_packets: 25989113891
    tx_tso_bytes: 132933363384458
    tx_tso_inner_packets: 0
    tx_tso_inner_bytes: 0
    tx_added_vlan_packets: 74630239613
    tx_nop: 2029817748
    rx_lro_packets: 0
    rx_lro_bytes: 0
    rx_ecn_mark: 0
    rx_removed_vlan_packets: 173730800927
    rx_csum_unnecessary: 0
    rx_csum_none: 434357
    rx_csum_complete: 173730366570
    rx_csum_unnecessary_inner: 0
    rx_xdp_drop: 0
    rx_xdp_redirect: 0
    rx_xdp_tx_xmit: 0
    rx_xdp_tx_full: 0
    rx_xdp_tx_err: 0
    rx_xdp_tx_cqe: 0
    tx_csum_none: 38260960853
    tx_csum_partial: 36369278774
    tx_csum_partial_inner: 0
    tx_queue_stopped: 1
    tx_queue_dropped: 0
    tx_xmit_more: 748638099
    tx_recover: 0
    tx_cqes: 73881645031
    tx_queue_wake: 1
    tx_udp_seg_rem: 0
    tx_cqe_err: 0
    tx_xdp_xmit: 0
    tx_xdp_full: 0
    tx_xdp_err: 0
    tx_xdp_cqes: 0
    rx_wqe_err: 0
    rx_mpwqe_filler_cqes: 0
    rx_mpwqe_filler_strides: 0
    rx_buff_alloc_err: 0
    rx_cqe_compress_blks: 0
    rx_cqe_compress_pkts: 0

If this is a pcie bottleneck it might be useful to  enable CQE
compression (to reduce PCIe completion descriptors transactions)
you should see the above rx_cqe_compress_pkts increasing when
enabled.

$ ethtool  --set-priv-flags enp175s0f1 rx_cqe_compress on
$ ethtool --show-priv-

Re: [PATCH iproute2 net-next 0/3] ss: Allow selection of columns to be displayed

2018-11-01 Thread Stephen Hemminger

On Thu, 1 Nov 2018 15:18:03 -0600
David Ahern  wrote:

> On 11/1/18 3:06 PM, Jakub Kicinski wrote:
> > On Wed, 31 Oct 2018 20:48:05 -0600, David Ahern wrote:  
> >>>   spacing with a special character in the format string, that is:
> >>>
> >>>   "%S.%Qr.%Qs  %Al:%Pl %Ar:%Pr  %p\n"
> >>>
> >>>   would mean "align everything to the right, distribute remaining
> >>>   whitespace between %S, %Qr and %Qs". But it looks rather complicated
> >>>   at a glance.
> >>> 
> >>
> >> My concern here is that once this goes in for 1 command, the others in
> >> iproute2 need to follow suit - meaning same syntax style for all
> >> commands. Given that I'd prefer we get a reasonable consensus on syntax
> >> that will work across commands -- ss, ip, tc. If it is as simple as
> >> column names with a fixed order, that is fine but just give proper
> >> consideration given the impact.  
> > 
> > FWIW I just started piping iproute2 commands to jq.  Example:
> > 
> > tc -s -j qdisc show dev em1 | \
> > jq -r '.[] |  
> > [.kind,.parent,.handle,.offloaded,.bytes,.packets,.drops,.overlimits,.requeues,.backlog,.qlen,.marked]
> >  | @tsv'
> > 
> > JSONification would probably be quite an undertaking for ss :(
> >   
> 
> Right, that is used in some of the scripts under
> tools/testing/selftests. I would put that in the 'heavyweight solution'
> category.
> 
> A number of key commands offer the capability to control the output via
> command line argument (e.g., ps, perf script). Given the amount of data
> iproute2 commands throw at a user by default, it would be a good
> usability feature to allow a user to customize the output without having
> to pipe it into other commands.

I would rather see ss grow json support than having to make the output
formatting of every iproute2 command grow a new format management.


The jq tool looks cool, and I can see how someone could easily have
a bunch of mini-scripts to do what they want.

Re: [PATCH net] rtnetlink: invoke 'cb->done' destructor before 'cb->args' reset

2018-11-01 Thread David Ahern

On 11/1/18 7:42 AM, Alexey Kodanev wrote:
> On 11/01/2018 04:11 PM, Alexey Kodanev wrote:
>> On 10/31/2018 08:35 PM, David Ahern wrote:
>>> On 10/31/18 10:55 AM, David Ahern wrote:
 I think the simplest fix for 4.20 is to break the loop if ret is non-0 -
 restore the previous behavior. 
>>>
>>> that is the only recourse. It has to bail if ret is non-0. Do you want
>>> to send a patch with that fix?
>>>
>>
>> I see, and inet6_dump_fib() cleanups fib6_walker if ret is zero. Will send 
>> the fix.
> 
> Can it happen that inet6_dump_fib() returns skb->len (0) in the below cases?
> 
> * if (arg.filter.flags & RTM_F_CLONED)
>   return skb->len;
> 
> ...
> 
>   w = (void *)cb->args[2];
>   if (!w) {
>   ...
>   w = kzalloc(...)
> ...
> 
> * if (arg.filter.table_id) {
> ...
>   if (!tb) {
>   if (arg.filter.dump_all_families)
>   return skb->len;
> 
> 
> Would it be safer to add "res = skb->len; goto out;" instead of "return 
> skb->len;"
> so that it can call fib6_dump_end() for "res <= 0"? Or use cb->data instead of
> cb->args?
> 

Since res is initialized to 0, both of those can just be 'goto out;'
The break in dump_all is still needed though.

Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets

On Wed, Oct 31, 2018 at 7:30 PM Christoph Paasch  wrote:
>
> Implementations of Quic might want to create a separate socket for each
> Quic-connection by creating a connected UDP-socket.
>
> To achieve that on the server-side, a "master-socket" needs to wait for
> incoming new connections and then creates a new socket that will be a
> connected UDP-socket. To create that latter one, the server needs to
> first bind() and then connect(). However, after the bind() the server
> might already receive traffic on that new socket that is unrelated to the
> Quic-connection at hand.

This can also be achieved with SO_REUSEPORT_BPF and a filter
that only selects the listener socket(s) in the group. The connect
call should call udp_lib_rehash and take the connected socket out
of the reuseport listener group. Though admittedly that is more
elaborate than setting a boolean socket option.

> The ideas for the implementation came up after a discussion with Ian
> and Jana re: their implementation of a QUIC server.

That might have preceded SO_TXTIME? AFAIK traffic shaping was the
only real reason to prefer connected sockets.

[PATCH v2 net] net: dsa: microchip: initialize mutex before use

2018-11-01 Thread Tristram.Ha

From: Tristram Ha 

Initialize mutex before use.  Avoid kernel complaint when
CONFIG_DEBUG_LOCK_ALLOC is enabled.

Fixes: b987e98e50ab90e5 ("dsa: add DSA switch driver for Microchip KSZ9477")
Signed-off-by: Tristram Ha 
Reviewed-by: Pavel Machek 
Reviewed-by: Andrew Lunn 
---
v2
- Add endorsements

v1
- Remove comment

 drivers/net/dsa/microchip/ksz_common.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/net/dsa/microchip/ksz_common.c 
b/drivers/net/dsa/microchip/ksz_common.c
index 54e0ca6..86b6464 100644
--- a/drivers/net/dsa/microchip/ksz_common.c
+++ b/drivers/net/dsa/microchip/ksz_common.c
@@ -1117,11 +1117,6 @@ static int ksz_switch_init(struct ksz_device *dev)
 {
int i;
 
-   mutex_init(&dev->reg_mutex);
-   mutex_init(&dev->stats_mutex);
-   mutex_init(&dev->alu_mutex);
-   mutex_init(&dev->vlan_mutex);
-
dev->ds->ops = &ksz_switch_ops;
 
for (i = 0; i < ARRAY_SIZE(ksz_switch_chips); i++) {
@@ -1206,6 +1201,11 @@ int ksz_switch_register(struct ksz_device *dev)
if (dev->pdata)
dev->chip_id = dev->pdata->chip_id;
 
+   mutex_init(&dev->reg_mutex);
+   mutex_init(&dev->stats_mutex);
+   mutex_init(&dev->alu_mutex);
+   mutex_init(&dev->vlan_mutex);
+
if (ksz_switch_detect(dev))
return -EINVAL;
 
-- 
1.9.1

Re: [PATCH v2 net] net: dsa: microchip: initialize mutex before use

2018-11-01 Thread Florian Fainelli

On 11/1/18 3:08 PM, tristram...@microchip.com wrote:
> From: Tristram Ha 
> 
> Initialize mutex before use.  Avoid kernel complaint when
> CONFIG_DEBUG_LOCK_ALLOC is enabled.
> 
> Fixes: b987e98e50ab90e5 ("dsa: add DSA switch driver for Microchip KSZ9477")
> Signed-off-by: Tristram Ha 
> Reviewed-by: Pavel Machek 
> Reviewed-by: Andrew Lunn 

Reviewed-by: Florian Fainelli 

> ---
> v2
> - Add endorsements

FWIW, David uses patchwork which automatically collects those tags into
the patch whenver we reply with one of the recognized/supported tag. Thanks!
-- 
Florian

Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets

2018-11-01 Thread Christoph Paasch

On 01/11/18 - 17:51:39, Willem de Bruijn wrote:
> On Wed, Oct 31, 2018 at 7:30 PM Christoph Paasch  wrote:
> >
> > Implementations of Quic might want to create a separate socket for each
> > Quic-connection by creating a connected UDP-socket.
> >
> > To achieve that on the server-side, a "master-socket" needs to wait for
> > incoming new connections and then creates a new socket that will be a
> > connected UDP-socket. To create that latter one, the server needs to
> > first bind() and then connect(). However, after the bind() the server
> > might already receive traffic on that new socket that is unrelated to the
> > Quic-connection at hand.
> 
> This can also be achieved with SO_REUSEPORT_BPF and a filter
> that only selects the listener socket(s) in the group. The connect
> call should call udp_lib_rehash and take the connected socket out
> of the reuseport listener group. Though admittedly that is more
> elaborate than setting a boolean socket option.

Yeah, that indeed would be quite a bit more elaborate ;-)

A simple socket-option is much easier.


Cheers,
Christoph

> 
> > The ideas for the implementation came up after a discussion with Ian
> > and Jana re: their implementation of a QUIC server.
> 
> That might have preceded SO_TXTIME? AFAIK traffic shaping was the
> only real reason to prefer connected sockets.

[PATCH iproute2] Include bsd/string.h only in include/utils.h

2018-11-01 Thread Luca Boccassi

This is simpler and cleaner, and avoids having to include the header
from every file where the functions are used. The prototypes of the
internal implementation are in this header, so utils.h will have to be
included anyway for those.

Fixes: 508f3c231efb ("Use libbsd for strlcpy if available")

Signed-off-by: Luca Boccassi 
---
 genl/ctrl.c   | 3 ---
 include/utils.h   | 4 
 ip/iplink.c   | 3 ---
 ip/ipnetns.c  | 3 ---
 ip/iproute_lwtunnel.c | 3 ---
 ip/ipvrf.c| 3 ---
 ip/ipxfrm.c   | 3 ---
 ip/tunnel.c   | 3 ---
 ip/xfrm_state.c   | 3 ---
 lib/bpf.c | 3 ---
 lib/fs.c  | 3 ---
 lib/inet_proto.c  | 3 ---
 misc/ss.c | 3 ---
 tc/em_ipset.c | 3 ---
 tc/m_pedit.c  | 3 ---
 15 files changed, 4 insertions(+), 42 deletions(-)

diff --git a/genl/ctrl.c b/genl/ctrl.c
index fef6aaa9..616a 100644
--- a/genl/ctrl.c
+++ b/genl/ctrl.c
@@ -18,9 +18,6 @@
 #include 
 #include 
 #include 
-#ifdef HAVE_LIBBSD
-#include 
-#endif
 
 #include "utils.h"
 #include "genl_utils.h"
diff --git a/include/utils.h b/include/utils.h
index 685d2c1d..bf6dea23 100644
--- a/include/utils.h
+++ b/include/utils.h
@@ -9,6 +9,10 @@
 #include 
 #include 
 
+#ifdef HAVE_LIBBSD
+#include 
+#endif
+
 #include "libnetlink.h"
 #include "ll_map.h"
 #include "rtm_map.h"
diff --git a/ip/iplink.c b/ip/iplink.c
index 067f5409..b5519201 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -24,9 +24,6 @@
 #include 
 #include 
 #include 
-#ifdef HAVE_LIBBSD
-#include 
-#endif
 #include 
 #include 
 #include 
diff --git a/ip/ipnetns.c b/ip/ipnetns.c
index da019d76..0eac18cf 100644
--- a/ip/ipnetns.c
+++ b/ip/ipnetns.c
@@ -8,9 +8,6 @@
 #include 
 #include 
 #include 
-#ifdef HAVE_LIBBSD
-#include 
-#endif
 #include 
 #include 
 #include 
diff --git a/ip/iproute_lwtunnel.c b/ip/iproute_lwtunnel.c
index 2285bc1d..8f497015 100644
--- a/ip/iproute_lwtunnel.c
+++ b/ip/iproute_lwtunnel.c
@@ -16,9 +16,6 @@
 #include 
 #include 
 #include 
-#ifdef HAVE_LIBBSD
-#include 
-#endif
 #include 
 #include 
 #include 
diff --git a/ip/ipvrf.c b/ip/ipvrf.c
index 8572b4f2..8a6b7f97 100644
--- a/ip/ipvrf.c
+++ b/ip/ipvrf.c
@@ -21,9 +21,6 @@
 #include 
 #include 
 #include 
-#ifdef HAVE_LIBBSD
-#include 
-#endif
 #include 
 #include 
 #include 
diff --git a/ip/ipxfrm.c b/ip/ipxfrm.c
index b02f30a6..17ab4abe 100644
--- a/ip/ipxfrm.c
+++ b/ip/ipxfrm.c
@@ -28,9 +28,6 @@
 #include 
 #include 
 #include 
-#ifdef HAVE_LIBBSD
-#include 
-#endif
 #include 
 #include 
 #include 
diff --git a/ip/tunnel.c b/ip/tunnel.c
index 73abb2e2..d0d55f37 100644
--- a/ip/tunnel.c
+++ b/ip/tunnel.c
@@ -24,9 +24,6 @@
 
 #include 
 #include 
-#ifdef HAVE_LIBBSD
-#include 
-#endif
 #include 
 #include 
 #include 
diff --git a/ip/xfrm_state.c b/ip/xfrm_state.c
index 18e0c6fa..e8c01746 100644
--- a/ip/xfrm_state.c
+++ b/ip/xfrm_state.c
@@ -27,9 +27,6 @@
 #include 
 #include 
 #include 
-#ifdef HAVE_LIBBSD
-#include 
-#endif
 #include 
 #include "utils.h"
 #include "xfrm.h"
diff --git a/lib/bpf.c b/lib/bpf.c
index 35d7c45a..45f279fa 100644
--- a/lib/bpf.c
+++ b/lib/bpf.c
@@ -15,9 +15,6 @@
 #include 
 #include 
 #include 
-#ifdef HAVE_LIBBSD
-#include 
-#endif
 #include 
 #include 
 #include 
diff --git a/lib/fs.c b/lib/fs.c
index af36bea0..86efd4ed 100644
--- a/lib/fs.c
+++ b/lib/fs.c
@@ -20,9 +20,6 @@
 #include 
 #include 
 #include 
-#ifdef HAVE_LIBBSD
-#include 
-#endif
 #include 
 #include 
 
diff --git a/lib/inet_proto.c b/lib/inet_proto.c
index b379d8f8..0836a4c9 100644
--- a/lib/inet_proto.c
+++ b/lib/inet_proto.c
@@ -18,9 +18,6 @@
 #include 
 #include 
 #include 
-#ifdef HAVE_LIBBSD
-#include 
-#endif
 
 #include "rt_names.h"
 #include "utils.h"
diff --git a/misc/ss.c b/misc/ss.c
index c472fbd9..4d12fb5d 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -19,9 +19,6 @@
 #include 
 #include 
 #include 
-#ifdef HAVE_LIBBSD
-#include 
-#endif
 #include 
 #include 
 #include 
diff --git a/tc/em_ipset.c b/tc/em_ipset.c
index 550b2101..48b287f5 100644
--- a/tc/em_ipset.c
+++ b/tc/em_ipset.c
@@ -20,9 +20,6 @@
 #include 
 #include 
 #include 
-#ifdef HAVE_LIBBSD
-#include 
-#endif
 #include 
 #include 
 
diff --git a/tc/m_pedit.c b/tc/m_pedit.c
index baacc80d..2aeb56d9 100644
--- a/tc/m_pedit.c
+++ b/tc/m_pedit.c
@@ -23,9 +23,6 @@
 #include 
 #include 
 #include 
-#ifdef HAVE_LIBBSD
-#include 
-#endif
 #include 
 #include "utils.h"
 #include "tc_util.h"
-- 
2.19.1

[PATCH net 0/2] net: timeout fixes for GENET and SYSTEMPORT

2018-11-01 Thread Florian Fainelli

Hi David,

This patch series fixes occasional transmit timeout around the time
the system goes into suspend. GENET and SYSTEMPORT have nearly the same
logic in that regard and were both affected in the same way.

Please queue up for stable, thanks!

Doug Berger (1):
  net: bcmgenet: protect stop from timeout

Florian Fainelli (1):
  net: systemport: Protect stop from timeout

 drivers/net/ethernet/broadcom/bcmsysport.c | 15 +++
 drivers/net/ethernet/broadcom/genet/bcmgenet.c | 13 +++--
 2 files changed, 14 insertions(+), 14 deletions(-)

-- 
2.17.1

[PATCH net 2/2] net: systemport: Protect stop from timeout

2018-11-01 Thread Florian Fainelli

A timing hazard exists when the network interface is stopped that
allows a watchdog timeout to be processed by a separate core in
parallel. This creates the potential for the timeout handler to
wake the queues while the driver is shutting down, or access
registers after their clocks have been removed.

The more common case is that the watchdog timeout will produce a
warning message which doesn't lead to a crash. The chances of this
are greatly increased by the fact that bcm_sysport_netif_stop stops
the transmit queues which can easily precipitate a watchdog time-
out because of stale trans_start data in the queues.

This commit corrects the behavior by ensuring that the watchdog
timeout is disabled before enterring bcm_sysport_netif_stop. There
are currently only two users of the bcm_sysport_netif_stop function:
close and suspend.

The close case already handles the issue by exiting the RUNNING
state before invoking the driver close service.

The suspend case now performs the netif_device_detach to exit the
PRESENT state before the call to bcm_sysport_netif_stop rather than
after it.

These behaviors prevent any future scheduling of the driver timeout
service during the window. The netif_tx_stop_all_queues function
in bcm_sysport_netif_stop is replaced with netif_tx_disable to ensure
synchronization with any transmit or timeout threads that may
already be executing on other cores.

For symmetry, the netif_device_attach call upon resume is moved to
after the call to bcm_sysport_netif_start. Since it wakes the transmit
queues it is not necessary to invoke netif_tx_start_all_queues from
bcm_sysport_netif_start so it is moved into the driver open service.

Fixes: 40755a0fce17 ("net: systemport: add suspend and resume support")
Fixes: 80105befdb4b ("net: systemport: add Broadcom SYSTEMPORT Ethernet MAC 
driver")
Signed-off-by: Florian Fainelli 
---
 drivers/net/ethernet/broadcom/bcmsysport.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
b/drivers/net/ethernet/broadcom/bcmsysport.c
index 4122553e224b..0e2d99c737e3 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -1902,9 +1902,6 @@ static void bcm_sysport_netif_start(struct net_device 
*dev)
intrl2_1_mask_clear(priv, 0x);
else
intrl2_0_mask_clear(priv, INTRL2_0_TDMA_MBDONE_MASK);
-
-   /* Last call before we start the real business */
-   netif_tx_start_all_queues(dev);
 }
 
 static void rbuf_init(struct bcm_sysport_priv *priv)
@@ -2048,6 +2045,8 @@ static int bcm_sysport_open(struct net_device *dev)
 
bcm_sysport_netif_start(dev);
 
+   netif_tx_start_all_queues(dev);
+
return 0;
 
 out_clear_rx_int:
@@ -2071,7 +2070,7 @@ static void bcm_sysport_netif_stop(struct net_device *dev)
struct bcm_sysport_priv *priv = netdev_priv(dev);
 
/* stop all software from updating hardware */
-   netif_tx_stop_all_queues(dev);
+   netif_tx_disable(dev);
napi_disable(&priv->napi);
cancel_work_sync(&priv->dim.dim.work);
phy_stop(dev->phydev);
@@ -2658,12 +2657,12 @@ static int __maybe_unused bcm_sysport_suspend(struct 
device *d)
if (!netif_running(dev))
return 0;
 
+   netif_device_detach(dev);
+
bcm_sysport_netif_stop(dev);
 
phy_suspend(dev->phydev);
 
-   netif_device_detach(dev);
-
/* Disable UniMAC RX */
umac_enable_set(priv, CMD_RX_EN, 0);
 
@@ -2746,8 +2745,6 @@ static int __maybe_unused bcm_sysport_resume(struct 
device *d)
goto out_free_rx_ring;
}
 
-   netif_device_attach(dev);
-
/* RX pipe enable */
topctrl_writel(priv, 0, RX_FLUSH_CNTL);
 
@@ -2788,6 +2785,8 @@ static int __maybe_unused bcm_sysport_resume(struct 
device *d)
 
bcm_sysport_netif_start(dev);
 
+   netif_device_attach(dev);
+
return 0;
 
 out_free_rx_ring:
-- 
2.17.1

Re: [PATCH bpf 0/3] show more accurrate bpf program address

2018-11-01 Thread Alexei Starovoitov

On Thu, Nov 01, 2018 at 12:00:55AM -0700, Song Liu wrote:
> This set improves bpf program address showed in /proc/kallsyms and in
> bpf_prog_info. First, real program address is showed instead of page
> address. Second, when there is no subprogram, bpf_prog_info->jited_ksyms
> returns the main prog address.

For the set:
Acked-by: Alexei Starovoitov 

I think we have to make the whole thing consistent like this set does
and send it to stable.
The only other alternative is to invent new cmd and prog_info fields to return
proper jited_ksyms and keep this one buggy forever.
My preference is to fix it.

Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

Hi Arnaldo,

On Wed, Oct 17, 2018 at 5:11 AM Arnaldo Carvalho de Melo
 wrote:
>
> Adding Alexey, Jiri and Namhyung as they worked/are working on
> multithreading 'perf record'.

I have read Alexey's work on enabling aio for perf-record
(https://lkml.org/lkml/2018/10/15/169).
But I feel it is not really related to this work. Did I miss anything here?

For VIP events, I think we need more mmap. Currently, the default setting
uses 1 mmap per cpu. To capture VIP events, I think we need another mmap
per CPU. The VIP events will be send to the special mmap. Then, user space
will process these event before the end of perf-record.

Does this approach make sense?

Thanks!
Song

Re: [PATCH net] qmi_wwan: Support dynamic config on Quectel EP06

From: Kristian Evensen 
Date: Thu, 1 Nov 2018 20:37:55 +0100

> On Thu, Nov 1, 2018 at 8:30 PM Kristian Evensen
>  wrote:
>>
>> On Thu, Nov 1, 2018 at 6:40 PM Kristian Evensen
>>  wrote:
>> >
>> > Hi,
>> >
>> > On Sat, Sep 8, 2018 at 1:50 PM Kristian Evensen
>> >  wrote:
>> > > Quectel EP06 (and EM06/EG06) supports dynamic configuration of USB
>> > > interfaces, without the device changing VID/PID or configuration number.
>> > > When the configuration is updated and interfaces are added/removed, the
>> > > interface numbers change. This means that the current code for matching
>> > > EP06 does not work.
>> >
>> > Would it be possible to have this patch added to stable? I checked
>> > both the 4.14-tree and the stable queue, but could not find the
>> > patch/upstream commit.
>>
>> Please ignore this request. I discovered patch does not apply to 4.14,
>> so I will do a manual backport and send to stable.
> 
> Blah, it is clearly not my day today. I discovered that my 4.14 build
> directory was dirty and that the patch applies fined on top of
> 4.14.78. Sorry about the extra noise and please do not ignore my
> request for stable :)

I am only doing v4.19 and v4.18 -stable submissions at this point.

Please contact the -stable release maintainer directly for requests
pertaining to older releases.

Thank you.

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

2018-11-01 Thread Aaron Lu

On Thu, Nov 01, 2018 at 08:23:19PM +, Saeed Mahameed wrote:
> On Thu, 2018-11-01 at 23:27 +0800, Aaron Lu wrote:
> > On Thu, Nov 01, 2018 at 10:22:13AM +0100, Jesper Dangaard Brouer
> > wrote:
> > ... ...
> > > Section copied out:
> > > 
> > >   mlx5e_poll_tx_cq
> > >   |  
> > >--16.34%--napi_consume_skb
> > >  |  
> > >  |--12.65%--__free_pages_ok
> > >  |  |  
> > >  |   --11.86%--free_one_page
> > >  | |  
> > >  | |--10.10%
> > > --queued_spin_lock_slowpath
> > >  | |  
> > >  |  --0.65%--_raw_spin_lock
> > 
> > This callchain looks like it is freeing higher order pages than order
> > 0:
> > __free_pages_ok is only called for pages whose order are bigger than
> > 0.
> 
> mlx5 rx uses only order 0 pages, so i don't know where these high order
> tx SKBs are coming from.. 

Perhaps here:
__netdev_alloc_skb(), __napi_alloc_skb(), __netdev_alloc_frag() and
__napi_alloc_frag() will all call page_frag_alloc(), which will use
__page_frag_cache_refill() to get an order 3 page if possible, or fall
back to an order 0 page if order 3 page is not available.

I'm not sure if your workload will use the above code path though.

Re: Business Proposal

2018-11-01 Thread Edward Yuan



Dear Friend, 

  My name is Mr. Edward Yuan, a consultant/broker. I know you might be a bit 
apprehensive because you do not know me. Nevertheless, I have a proposal on 
behalf of a client, a lucrative business that might be of mutual benefit to you.

If interested in this proposition please kindly and urgently contact me for 
more details. 

Best Regards.
Mr. Edward Yuan.

---
This email has been checked for viruses by AVG.
https://www.avg.com

[PATCH bpf 1/3] bpf: show real jited prog address in /proc/kallsyms

Currently, /proc/kallsyms shows page address of jited bpf program. This
is not ideal for detailed profiling (find hot instructions from stack
traces). This patch replaces the page address with real prog start
address.

Signed-off-by: Song Liu 
---
 kernel/bpf/core.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 6377225b2082..1a796e0799ec 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -553,7 +553,6 @@ bool is_bpf_text_address(unsigned long addr)
 int bpf_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
char *sym)
 {
-   unsigned long symbol_start, symbol_end;
struct bpf_prog_aux *aux;
unsigned int it = 0;
int ret = -ERANGE;
@@ -566,10 +565,9 @@ int bpf_get_kallsym(unsigned int symnum, unsigned long 
*value, char *type,
if (it++ != symnum)
continue;
 
-   bpf_get_prog_addr_region(aux->prog, &symbol_start, &symbol_end);
bpf_get_prog_name(aux->prog, sym);
 
-   *value = symbol_start;
+   *value = (unsigned long)aux->prog->bpf_func;
*type  = BPF_SYM_ELF_TYPE;
 
ret = 0;
-- 
2.17.1

[PATCH bpf 2/3] bpf: show real jited address in bpf_prog_info->jited_ksyms

Currently, jited_ksyms in bpf_prog_info shows page addresses of jited
bpf program. This is not ideal for detailed profiling (find hot
instructions from stack traces). This patch replaces the page address
with real prog start address.

Signed-off-by: Song Liu 
---
 kernel/bpf/syscall.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index ccb93277aae2..34a9eef5992c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2172,7 +2172,6 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
user_ksyms = u64_to_user_ptr(info.jited_ksyms);
for (i = 0; i < ulen; i++) {
ksym_addr = (ulong) 
prog->aux->func[i]->bpf_func;
-   ksym_addr &= PAGE_MASK;
if (put_user((u64) ksym_addr, &user_ksyms[i]))
return -EFAULT;
}
-- 
2.17.1

[PATCH bpf 0/3] show more accurrate bpf program address

This set improves bpf program address showed in /proc/kallsyms and in
bpf_prog_info. First, real program address is showed instead of page
address. Second, when there is no subprogram, bpf_prog_info->jited_ksyms
returns the main prog address.

Song Liu (3):
  bpf: show real jited prog address in /proc/kallsyms
  bpf: show real jited address in bpf_prog_info->jited_ksyms
  bpf: show main program address in bpf_prog_info->jited_ksyms

 kernel/bpf/core.c|  4 +---
 kernel/bpf/syscall.c | 17 -
 2 files changed, 13 insertions(+), 8 deletions(-)

--
2.17.1

[PATCH bpf 3/3] bpf: show main program address in bpf_prog_info->jited_ksyms

Currently, when there is not subprog (prog->aux->func_cnt == 0),
bpf_prog_info does not return any jited_ksyms. This patch adds
main program address (prog->bpf_func) to jited_ksyms.

Signed-off-by: Song Liu 
---
 kernel/bpf/syscall.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 34a9eef5992c..7293b17ca62a 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2158,7 +2158,7 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
}
 
ulen = info.nr_jited_ksyms;
-   info.nr_jited_ksyms = prog->aux->func_cnt;
+   info.nr_jited_ksyms = prog->aux->func_cnt ? : 1;
if (info.nr_jited_ksyms && ulen) {
if (bpf_dump_raw_ok()) {
u64 __user *user_ksyms;
@@ -2170,9 +2170,17 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
 */
ulen = min_t(u32, info.nr_jited_ksyms, ulen);
user_ksyms = u64_to_user_ptr(info.jited_ksyms);
-   for (i = 0; i < ulen; i++) {
-   ksym_addr = (ulong) 
prog->aux->func[i]->bpf_func;
-   if (put_user((u64) ksym_addr, &user_ksyms[i]))
+   if (prog->aux->func_cnt) {
+   for (i = 0; i < ulen; i++) {
+   ksym_addr = (ulong)
+   prog->aux->func[i]->bpf_func;
+   if (put_user((u64) ksym_addr,
+&user_ksyms[i]))
+   return -EFAULT;
+   }
+   } else {
+   ksym_addr = (ulong) prog->bpf_func;
+   if (put_user((u64) ksym_addr, &user_ksyms[0]))
return -EFAULT;
}
} else {
-- 
2.17.1

[PATCH iproute2-next] rdma: Refresh help section of resource information

2018-11-01 Thread Leon Romanovsky

From: Leon Romanovsky 

After commit 4060e4c0d257 ("rdma: Add PD resource tracking
information"), the resource information shows PDs and MRs,
but help pages didn't fully reflect it.

Signed-off-by: Leon Romanovsky 
---
 rdma/res.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/rdma/res.c b/rdma/res.c
index 0d8c1c38..cbb2efe6 100644
--- a/rdma/res.c
+++ b/rdma/res.c
@@ -16,13 +16,17 @@ static int res_help(struct rd *rd)
 {
pr_out("Usage: %s resource\n", rd->filename);
pr_out("  resource show [DEV]\n");
-   pr_out("  resource show [qp|cm_id]\n");
+   pr_out("  resource show [qp|cm_id|pd|mr|cq]\n");
pr_out("  resource show qp link [DEV/PORT]\n");
pr_out("  resource show qp link [DEV/PORT] [FILTER-NAME 
FILTER-VALUE]\n");
pr_out("  resource show cm_id link [DEV/PORT]\n");
pr_out("  resource show cm_id link [DEV/PORT] [FILTER-NAME 
FILTER-VALUE]\n");
pr_out("  resource show cq link [DEV/PORT]\n");
pr_out("  resource show cq link [DEV/PORT] [FILTER-NAME 
FILTER-VALUE]\n");
+   pr_out("  resource show pd dev [DEV]\n");
+   pr_out("  resource show pd dev [DEV] [FILTER-NAME 
FILTER-VALUE]\n");
+   pr_out("  resource show mr dev [DEV]\n");
+   pr_out("  resource show mr dev [DEV] [FILTER-NAME 
FILTER-VALUE]\n");
return 0;
 }

--
2.14.4

Re: [PATCH v2 net 3/3] net/mlx4_en: use __netdev_tx_sent_queue()

2018-11-01 Thread Tariq Toukan



On 31/10/2018 5:39 PM, Eric Dumazet wrote:
> doorbell only depends on xmit_more and netif_tx_queue_stopped()
> 
> Using __netdev_tx_sent_queue() avoids messing with BQL stop flag,
> and is more generic.
> 
> This patch increases performance on GSO workload by keeping
> doorbells to the minimum required.
> 
> Signed-off-by: Eric Dumazet 
> Cc: Tariq Toukan 
> ---
>   drivers/net/ethernet/mellanox/mlx4/en_tx.c | 6 --
>   1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c 
> b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> index 
> 1857ee0f0871d48285a6d3711f7c3e9a1e08a05f..6f5153afcab4dfc331c099da854c54f1b9500887
>  100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> @@ -1006,7 +1006,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct 
> net_device *dev)
>   ring->packets++;
>   }
>   ring->bytes += tx_info->nr_bytes;
> - netdev_tx_sent_queue(ring->tx_queue, tx_info->nr_bytes);
>   AVG_PERF_COUNTER(priv->pstats.tx_pktsz_avg, skb->len);
>   
>   if (tx_info->inl)
> @@ -1044,7 +1043,10 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct 
> net_device *dev)
>   netif_tx_stop_queue(ring->tx_queue);
>   ring->queue_stopped++;
>   }
> - send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
> +
> + send_doorbell = __netdev_tx_sent_queue(ring->tx_queue,
> +tx_info->nr_bytes,
> +skb->xmit_more);
>   
>   real_size = (real_size / 16) & 0x3f;
>   
> 

Reviewed-by: Tariq Toukan 

Looks good to me.
Thanks.

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

2018-11-01 Thread Jesper Dangaard Brouer

On Wed, 31 Oct 2018 23:20:01 +0100
Paweł Staszewski  wrote:

> W dniu 31.10.2018 o 23:09, Eric Dumazet pisze:
> >
> > On 10/31/2018 02:57 PM, Paweł Staszewski wrote:  
> >> Hi
> >>
> >> So maybee someone will be interested how linux kernel handles
> >> normal traffic (not pktgen :) )

Pawel is this live production traffic?

I know Yoel (Cc) is very interested to know the real-life limitation of
Linux as a router, especially with VLANs like you use.


> >>
> >> Server HW configuration:
> >>
> >> CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
> >>
> >> NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)
> >>
> >>
> >> Server software:
> >>
> >> FRR - as routing daemon
> >>
> >> enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local numa 
> >> node)
> >>
> >> enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local numa node)
> >>
> >>
> >> Maximum traffic that server can handle:
> >>
> >> Bandwidth
> >>
> >>   bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
> >>    input: /proc/net/dev type: rate
> >>    \ iface   Rx Tx    Total
> >> ==
> >>     enp175s0f1:  28.51 Gb/s   37.24 Gb/s   
> >> 65.74 Gb/s
> >>     enp175s0f0:  38.07 Gb/s   28.44 Gb/s   
> >> 66.51 Gb/s
> >> --
> >>      total:  66.58 Gb/s   65.67 Gb/s  
> >> 132.25 Gb/s
> >>

Actually rather impressive number for a Linux router.

> >>
> >> Packets per second:
> >>
> >>   bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
> >>    input: /proc/net/dev type: rate
> >>    - iface   Rx Tx    Total
> >> ==
> >>     enp175s0f1:  5248589.00 P/s   3486617.75 P/s 8735207.00 P/s
> >>     enp175s0f0:  3557944.25 P/s   5232516.00 P/s 8790460.00 P/s
> >> --
> >>      total:  8806533.00 P/s   8719134.00 P/s 17525668.00 
> >> P/s
> >>

Average packet size:
  (28.51*10^9/8)/5248589 =  678.99 bytes 
  (38.07*10^9/8)/3557944 = 1337.49 bytes


> >> After reaching that limits nics on the upstream side (more RX
> >> traffic) start to drop packets
> >>
> >>
> >> I just dont understand that server can't handle more bandwidth
> >> (~40Gbit/s is limit where all cpu's are 100% util) - where pps on
> >> RX side are increasing.
> >>
> >> Was thinking that maybee reached some pcie x16 limit - but x16 8GT
> >> is 126Gbit - and also when testing with pktgen i can reach more bw
> >> and pps (like 4x more comparing to normal internet traffic)
> >>
> >> And wondering if there is something that can be improved here.
> >>
> >>
> >>
> >> Some more informations / counters / stats and perf top below:
> >>
> >> Perf top flame graph:
> >>
> >> https://uploadfiles.io/7zo6u

Thanks a lot for the flame graph!

> >>
> >> System configuration(long):
> >>
> >>
> >> cat /sys/devices/system/node/node1/cpulist
> >> 14-27,42-55
> >> cat /sys/class/net/enp175s0f0/device/numa_node
> >> 1
> >> cat /sys/class/net/enp175s0f1/device/numa_node
> >> 1
> >>

Hint grep can give you nicer output that cat:

$ grep -H . /sys/class/net/*/device/numa_node

> >>
> >>
> >>
> >>
> >> ip -s -d link ls dev enp175s0f0
> >> 6: enp175s0f0:  mtu 1500 qdisc mq state 
> >> UP mode DEFAULT group default qlen 8192
> >>      link/ether 0c:c4:7a:d8:5d:1c brd ff:ff:ff:ff:ff:ff promiscuity 0 
> >> addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 
> >> gso_max_segs 65535
> >>      RX: bytes  packets  errors  dropped overrun mcast
> >>      184142375840858 141347715974 2   2806325 0   85050528
> >>      TX: bytes  packets  errors  dropped carrier collsns
> >>      99270697277430 172227994003 0   0   0   0
> >>
> >>   ip -s -d link ls dev enp175s0f1
> >> 7: enp175s0f1:  mtu 1500 qdisc mq state 
> >> UP mode DEFAULT group default qlen 8192
> >>      link/ether 0c:c4:7a:d8:5d:1d brd ff:ff:ff:ff:ff:ff promiscuity 0 
> >> addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 
> >> gso_max_segs 65535
> >>      RX: bytes  packets  errors  dropped overrun mcast
> >>      99686284170801 173507590134 61  669685  0   100304421
> >>      TX: bytes  packets  errors  dropped carrier collsns
> >>      184435107970545 142383178304 0   0   0   0
> >>

You have increased the default (1000) qlen to 8192, why?

What default qdisc do you run?... looking through your very detail main
email report (I do love the details you give!).  You run
pfifo_fast_dequeue, thus this 8192 qlen is actually having effect.

I would like to know if and how much qdisc_dequeue bulking is happening
in this setup?  Can you run:

 perf-stat-hist -m 8192 -P2 qdisc:

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic





W dniu 01.11.2018 o 10:22, Jesper Dangaard Brouer pisze:

On Wed, 31 Oct 2018 23:20:01 +0100
Paweł Staszewski  wrote:


W dniu 31.10.2018 o 23:09, Eric Dumazet pisze:

On 10/31/2018 02:57 PM, Paweł Staszewski wrote:

Hi

So maybee someone will be interested how linux kernel handles
normal traffic (not pktgen :) )

Pawel is this live production traffic?
Yes moved server from testlab to production to check (risking a little - 
but this is traffic switched to backup router : ) )




I know Yoel (Cc) is very interested to know the real-life limitation of
Linux as a router, especially with VLANs like you use.
So yes this is real-life traffic , real users - normal mixed internet 
traffic forwarded (including ddos-es :) )








Server HW configuration:

CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz

NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)


Server software:

FRR - as routing daemon

enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local numa node)

enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local numa node)


Maximum traffic that server can handle:

Bandwidth

   bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
    input: /proc/net/dev type: rate
    \ iface   Rx Tx    Total
==
     enp175s0f1:  28.51 Gb/s   37.24 Gb/s   65.74 
Gb/s
     enp175s0f0:  38.07 Gb/s   28.44 Gb/s   66.51 
Gb/s
--
      total:  66.58 Gb/s   65.67 Gb/s  132.25 
Gb/s


Actually rather impressive number for a Linux router.


Packets per second:

   bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
    input: /proc/net/dev type: rate
    - iface   Rx Tx    Total
==
     enp175s0f1:  5248589.00 P/s   3486617.75 P/s 8735207.00 P/s
     enp175s0f0:  3557944.25 P/s   5232516.00 P/s 8790460.00 P/s
--
      total:  8806533.00 P/s   8719134.00 P/s 17525668.00 P/s


Average packet size:
   (28.51*10^9/8)/5248589 =  678.99 bytes
   (38.07*10^9/8)/3557944 = 1337.49 bytes



After reaching that limits nics on the upstream side (more RX
traffic) start to drop packets


I just dont understand that server can't handle more bandwidth
(~40Gbit/s is limit where all cpu's are 100% util) - where pps on
RX side are increasing.

Was thinking that maybee reached some pcie x16 limit - but x16 8GT
is 126Gbit - and also when testing with pktgen i can reach more bw
and pps (like 4x more comparing to normal internet traffic)

And wondering if there is something that can be improved here.



Some more informations / counters / stats and perf top below:

Perf top flame graph:

https://uploadfiles.io/7zo6u

Thanks a lot for the flame graph!


System configuration(long):


cat /sys/devices/system/node/node1/cpulist
14-27,42-55
cat /sys/class/net/enp175s0f0/device/numa_node
1
cat /sys/class/net/enp175s0f1/device/numa_node
1


Hint grep can give you nicer output that cat:

$ grep -H . /sys/class/net/*/device/numa_node

Sure:
grep -H . /sys/class/net/*/device/numa_node
/sys/class/net/enp175s0f0/device/numa_node:1
/sys/class/net/enp175s0f1/device/numa_node:1











ip -s -d link ls dev enp175s0f0
6: enp175s0f0:  mtu 1500 qdisc mq state UP 
mode DEFAULT group default qlen 8192
      link/ether 0c:c4:7a:d8:5d:1c brd ff:ff:ff:ff:ff:ff promiscuity 0 
addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 
gso_max_segs 65535
      RX: bytes  packets  errors  dropped overrun mcast
      184142375840858 141347715974 2   2806325 0   85050528
      TX: bytes  packets  errors  dropped carrier collsns
      99270697277430 172227994003 0   0   0   0

   ip -s -d link ls dev enp175s0f1
7: enp175s0f1:  mtu 1500 qdisc mq state UP 
mode DEFAULT group default qlen 8192
      link/ether 0c:c4:7a:d8:5d:1d brd ff:ff:ff:ff:ff:ff promiscuity 0 
addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 
gso_max_segs 65535
      RX: bytes  packets  errors  dropped overrun mcast
      99686284170801 173507590134 61  669685  0   100304421
      TX: bytes  packets  errors  dropped carrier collsns
      184435107970545 142383178304 0   0   0   0


You have increased the default (1000) qlen to 8192, why?

Was checking if higher txq will change anything
But no change for settings 1000,4096,8192
But yes i do not use there any traffic shaping like hfsc/hdb etc
- just default qdisc mq 0:
root pfifp_fast
tc qdisc show dev enp175s0f1
qdisc mq 0: root
qdisc pfifo_fast 0: parent :38 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
1 1 1 1
qdisc pfifo_fast 0: parent :37 bands 3 priomap

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

2018-11-01 Thread Jesper Dangaard Brouer

On Wed, 31 Oct 2018 21:37:16 -0600 David Ahern wrote:

> This is mainly a forwarding use case? Seems so based on the perf report.
> I suspect forwarding with XDP would show pretty good improvement.

Yes, significant performance improvements.

Notice Davids talk: "Leveraging Kernel Tables with XDP"
http://vger.kernel.org/lpc-networking2018.html#session-1

It looks like that you are doing "pure" IP-routing, without any
iptables conntrack stuff (from your perf report data). That will
actually be a really good use-case for accelerating this with XDP.

I want you to understand the philosophy behind how David and I want
people to leverage XDP. Think of XDP as a software offload layer for
the kernel network stack. Setup and use Linux kernel network stack, but
accelerate parts of it with XDP, e.g. the route FIB lookup.

Sample code avail here:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/bpf/xdp_fwd_kern.c

(I do warn, what we just found a bug/crash in setup+tairdown for the
mlx5 driver you are using, that we/mlx _will_ fix soon)

> You need the vlan changes I have queued up though.

I know Yoel will be very interested in those changes too! I've
convinced Yoel to write an XDP program for his Border Network Gateway
(BNG) production system[1], and his is a heavy VLAN user. And the plan
is to Open Source this when he have-something-working.

[1] https://www.version2.dk/blog/software-router-del-5-linux-bng-1086060
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH v1 net] net: dsa: microchip: initialize mutex before use

2018-11-01 Thread Andrew Lunn

On Wed, Oct 31, 2018 at 07:49:08PM -0700, tristram...@microchip.com wrote:
> From: Tristram Ha 
> 
> Initialize mutex before use.  Avoid kernel complaint when
> CONFIG_DEBUG_LOCK_ALLOC is enabled.
> 
> Fixes: b987e98e50ab90e5 ("dsa: add DSA switch driver for Microchip KSZ9477")
> Signed-off-by: Tristram Ha 

Reviewed-by: Andrew Lunn 

Andrew

Re: [PATCH net] rtnetlink: invoke 'cb->done' destructor before 'cb->args' reset

2018-11-01 Thread Alexey Kodanev

On 10/31/2018 08:35 PM, David Ahern wrote:
> On 10/31/18 10:55 AM, David Ahern wrote:
>> I think the simplest fix for 4.20 is to break the loop if ret is non-0 -
>> restore the previous behavior. 
> 
> that is the only recourse. It has to bail if ret is non-0. Do you want
> to send a patch with that fix?
> 

I see, and inet6_dump_fib() cleanups fib6_walker if ret is zero. Will send the 
fix.

RE: [PATCH iproute2-next] rdma: Refresh help section of resource information

2018-11-01 Thread Steve Wise




> -Original Message-
> From: Leon Romanovsky 
> Sent: Thursday, November 1, 2018 3:35 AM
> To: David Ahern 
> Cc: Leon Romanovsky ; netdev
> ; RDMA mailing list ;
> Stephen Hemminger ; Steve Wise
> 
> Subject: [PATCH iproute2-next] rdma: Refresh help section of resource
> information
> 
> From: Leon Romanovsky 
> 
> After commit 4060e4c0d257 ("rdma: Add PD resource tracking
> information"), the resource information shows PDs and MRs,
> but help pages didn't fully reflect it.
> 
> Signed-off-by: Leon Romanovsky 

Oops.  Thanks.  Looks fine.

Reviewed-by: Steve Wise

Re: [PATCH net] rtnetlink: invoke 'cb->done' destructor before 'cb->args' reset

2018-11-01 Thread Alexey Kodanev

On 11/01/2018 04:11 PM, Alexey Kodanev wrote:
> On 10/31/2018 08:35 PM, David Ahern wrote:
>> On 10/31/18 10:55 AM, David Ahern wrote:
>>> I think the simplest fix for 4.20 is to break the loop if ret is non-0 -
>>> restore the previous behavior. 
>>
>> that is the only recourse. It has to bail if ret is non-0. Do you want
>> to send a patch with that fix?
>>
> 
> I see, and inet6_dump_fib() cleanups fib6_walker if ret is zero. Will send 
> the fix.

Can it happen that inet6_dump_fib() returns skb->len (0) in the below cases?

*   if (arg.filter.flags & RTM_F_CLONED)
return skb->len;

...

w = (void *)cb->args[2];
if (!w) {
...
w = kzalloc(...)
...

*   if (arg.filter.table_id) {
...
if (!tb) {
if (arg.filter.dump_all_families)
return skb->len;


Would it be safer to add "res = skb->len; goto out;" instead of "return 
skb->len;"
so that it can call fib6_dump_end() for "res <= 0"? Or use cb->data instead of
cb->args?

Re: [PATCH v1 net] net: dsa: microchip: initialize mutex before use

2018-11-01 Thread Pavel Machek

On Thu 2018-11-01 13:17:19, Andrew Lunn wrote:
> On Wed, Oct 31, 2018 at 07:49:08PM -0700, tristram...@microchip.com wrote:
> > From: Tristram Ha 
> > 
> > Initialize mutex before use.  Avoid kernel complaint when
> > CONFIG_DEBUG_LOCK_ALLOC is enabled.
> > 
> > Fixes: b987e98e50ab90e5 ("dsa: add DSA switch driver for Microchip KSZ9477")
> > Signed-off-by: Tristram Ha 

Reviewed-by: Pavel Machek 
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

W dniu 01.11.2018 o 11:55, Jesper Dangaard Brouer pisze:

On Wed, 31 Oct 2018 21:37:16 -0600 David Ahern wrote:

This is mainly a forwarding use case? Seems so based on the perf report.
I suspect forwarding with XDP would show pretty good improvement.

Yes, significant performance improvements.

Notice Davids talk: "Leveraging Kernel Tables with XDP"
http://vger.kernel.org/lpc-networking2018.html#session-1

It will be rly interesting

It looks like that you are doing "pure" IP-routing, without any
iptables conntrack stuff (from your perf report data). That will
actually be a really good use-case for accelerating this with XDP.

Yes pure IP routing
iptables used only for some local input filtering.

Sample code avail here:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/bpf/xdp_fwd_kern.c
I can try some tests on same hw but testlab configuration - will give it
a try :)

(I do warn, what we just found a bug/crash in setup+tairdown for the
mlx5 driver you are using, that we/mlx _will_ fix soon)

You need the vlan changes I have queued up though.

[1] https://www.version2.dk/blog/software-router-del-5-linux-bng-1086060

Ok - for now i need to split traffic into two separate 100G ports placed
in two different x16 pciexpress slots to check if the problem is mainly
caused by no more pciex x16 bandwidth available.

Re: [net 1/8] igb: shorten maximum PHC timecounter update interval

2018-11-01 Thread Miroslav Lichvar

On Wed, Oct 31, 2018 at 12:42:47PM -0700, Jeff Kirsher wrote:
> From: Miroslav Lichvar 
> 
> The timecounter needs to be updated at least once per ~550 seconds in
> order to avoid a 40-bit SYSTIM timestamp to be misinterpreted as an old
> timestamp.
> 
> Since commit 500462a9d ("timers: Switch to a non-cascading wheel"),
> scheduling of delayed work seems to be less accurate and a requested
> delay of 540 seconds may actually be longer than 550 seconds. Shorten
> the delay to 480 seconds to be sure the timecounter is updated in time.

It looks like this is the v1 of the patch. There was a v2 I sent on
Oct 26, which made the interval even shorter. I can send a separate
patch for that change.

-- 
Miroslav Lichvar

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

2018-11-01 Thread Aaron Lu

On Thu, Nov 01, 2018 at 10:22:13AM +0100, Jesper Dangaard Brouer wrote:
... ...
> Section copied out:
> 
>   mlx5e_poll_tx_cq
>   |  
>--16.34%--napi_consume_skb
>  |  
>  |--12.65%--__free_pages_ok
>  |  |  
>  |   --11.86%--free_one_page
>  | |  
>  | |--10.10%--queued_spin_lock_slowpath
>  | |  
>  |  --0.65%--_raw_spin_lock

This callchain looks like it is freeing higher order pages than order 0:
__free_pages_ok is only called for pages whose order are bigger than 0.

>  |  
>  |--1.55%--page_frag_free
>  |  
>   --1.44%--skb_release_data
> 
> 
> Let me explain what (I think) happens.  The mlx5 driver RX-page recycle
> mechanism is not effective in this workload, and pages have to go
> through the page allocator.  The lock contention happens during mlx5
> DMA TX completion cycle.  And the page allocator cannot keep up at
> these speeds.
> 
> One solution is extend page allocator with a bulk free API.  (This have
> been on my TODO list for a long time, but I don't have a
> micro-benchmark that trick the driver page-recycle to fail).  It should
> fit nicely, as I can see that kmem_cache_free_bulk() does get
> activated (bulk freeing SKBs), which means that DMA TX completion do
> have a bulk of packets. 
> 
> We can (and should) also improve the page recycle scheme in the driver.
> After LPC, I have a project with Tariq and Ilias (Cc'ed) to improve the
> page_pool, and we will (attempt) to generalize this, for both high-end
> mlx5 and more low-end ARM64-boards (macchiatobin and espressobin).
> 
> The MM-people is in parallel working to improve the performance of
> order-0 page returns.  Thus, the explicit page bulk free API might
> actually become less important.  I actually think (Cc.) Aaron have a
> patchset he would like you to test, which removes the (zone->)lock
> you hit in free_one_page().

Thanks Jesper.

Yes, the said patchset is in this branch:
https://github.com/aaronlu/linux no_merge_cluster_alloc_4.19-rc5

But as I said above, I think the lock contention here is for
order > 0 pages so my current patchset will not work here, unfortunately.

BTW, Mel Gorman has suggested an alternative way to improve page
allocator's scalability and I'm working on it right now, it will
improve page allocator's scalability for all order pages. I might be
able to post it some time next week, will CC all of you when it's ready.

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic





W dniu 01.11.2018 o 12:09, Paweł Staszewski pisze:

rx_cqe_compress_pkts: 0

If this is a pcie bottleneck it might be useful to  enable CQE
compression (to reduce PCIe completion descriptors transactions)
you should see the above rx_cqe_compress_pkts increasing when enabled.

$ ethtool  --set-priv-flags enp175s0f1 rx_cqe_compress on
$ ethtool --show-priv-flags enp175s0f1
Private flags for p6p1:
rx_cqe_moder   : on
cqe_moder  : off
rx_cqe_compress    : on
...

try this on both interfaces.

Done
ethtool --show-priv-flags enp175s0f1
Private flags for enp175s0f1:
rx_cqe_moder   : on
tx_cqe_moder   : off
rx_cqe_compress    : on
rx_striding_rq : off
rx_no_csum_complete: off

ethtool --show-priv-flags enp175s0f0
Private flags for enp175s0f0:
rx_cqe_moder   : on
tx_cqe_moder   : off
rx_cqe_compress    : on
rx_striding_rq : off
rx_no_csum_complete: off 
Enabling cqe compress changes nothing after reaching 64Gbit RX / 
64Gbit/s TX on interfaces cpu's are saturated at 100%


ethtool -S enp175s0f1 | grep rx_cqe_compress
 rx_cqe_compress_blks: 5657836379
 rx_cqe_compress_pkts: 13153761080

ethtool -S enp175s0f0 | grep rx_cqe_compress
 rx_cqe_compress_blks: 5994612500
 rx_cqe_compress_pkts: 13579014869


 bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
  input: /proc/net/dev type: rate
  - iface   Rx Tx    Total
==
   enp175s0f1:  27.03 Gb/s   37.09 Gb/s   
64.12 Gb/s
   enp175s0f0:  36.84 Gb/s   26.82 Gb/s   
63.66 Gb/s

--
    total:  63.85 Gb/s   63.87 Gb/s 127.72 Gb/s

bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
  input: /proc/net/dev type: rate
  / iface   Rx Tx    Total
==
   enp175s0f1:   3.22 GB/s    4.26 GB/s    
7.48 GB/s
   enp175s0f0:   4.24 GB/s    3.21 GB/s    
7.45 GB/s

--
    total:   7.46 GB/s    7.47 GB/s   
14.93 GB/s


mpstat
Average: CPU    %usr   %nice    %sys %iowait    %irq   %soft %steal  
%guest  %gnice   %idle
Average: all    0.05    0.00    0.19    0.02    0.00   42.74 0.00    
0.00    0.00   56.99
Average:   0    0.00    0.00    0.00    0.00    0.00    0.00 0.00    
0.00    0.00  100.00
Average:   1    0.00    0.00    0.30    0.00    0.00    0.00 0.00    
0.00    0.00   99.70
Average:   2    0.00    0.00    0.20    0.00    0.00    0.00 0.00    
0.00    0.00   99.80
Average:   3    0.00    0.00    0.20    1.20    0.00    0.00 0.00    
0.00    0.00   98.60
Average:   4    0.10    0.00    0.00    0.00    0.00    0.00 0.00    
0.00    0.00   99.90
Average:   5    0.00    0.00    0.10    0.00    0.00    0.00 0.00    
0.00    0.00   99.90
Average:   6    0.10    0.00    0.20    0.00    0.00    0.00 0.00    
0.00    0.00   99.70
Average:   7    0.00    0.00    0.00    0.00    0.00    0.00 0.00    
0.00    0.00  100.00
Average:   8    0.00    0.00    0.00    0.00    0.00    0.00 0.00    
0.00    0.00  100.00
Average:   9    0.00    0.00    0.00    0.00    0.00    0.00 0.00    
0.00    0.00  100.00
Average:  10    0.00    0.00    0.00    0.00    0.00    0.00 0.00    
0.00    0.00  100.00
Average:  11    0.00    0.00    0.00    0.00    0.00    0.00 0.00    
0.00    0.00  100.00
Average:  12    1.40    0.00    4.50    0.00    0.00    0.00 0.00    
0.00    0.00   94.10
Average:  13    0.00    0.00    1.60    0.00    0.00    0.00 0.00    
0.00    0.00   98.40
Average:  14    0.00    0.00    0.00    0.00    0.00   84.10 0.00    
0.00    0.00   15.90
Average:  15    0.00    0.00    0.10    0.00    0.00   93.70 0.00    
0.00    0.00    6.20
Average:  16    0.00    0.00    0.10    0.00    0.00   94.31 0.00    
0.00    0.00    5.59
Average:  17    0.00    0.00    0.00    0.00    0.00   95.30 0.00    
0.00    0.00    4.70
Average:  18    0.00    0.00    0.00    0.00    0.00   62.80 0.00    
0.00    0.00   37.20
Average:  19    0.00    0.00    0.10    0.00    0.00   98.90 0.00    
0.00    0.00    1.00
Average:  20    0.00    0.00    0.00    0.00    0.00   99.30 0.00    
0.00    0.00    0.70
Average:  21    0.00    0.00    0.00    0.00    0.00  100.00 0.00    
0.00    0.00    0.00
Average:  22    0.00    0.00    0.00    0.00    0.00   99.90 0.00    
0.00    0.00    0.10
Average:  23    0.00    0.00    0.10    0.00    0.00   99.90 0.00    
0.00    0.00    0.00
Average:  24    0.00    0.00    0.10    0.00    0.00   97.10 0.00    
0.00    0.00    2.80
Average:  25    0.00    0.00    0.00    0.0

Re: [RFC bpf-next] libbpf: increase rlimit before trying to create BPF maps

2018-11-01 Thread Quentin Monnet

2018-10-30 15:23 UTC+ ~ Quentin Monnet 
> The limit for memory locked in the kernel by a process is usually set to
> 64 bytes by default. This can be an issue when creating large BPF maps.
> A workaround is to raise this limit for the current process before
> trying to create a new BPF map. Changing the hard limit requires the
> CAP_SYS_RESOURCE and can usually only be done by root user (but then
> only root can create BPF maps).

Sorry, the parenthesis is not correct: non-root users can in fact create
BPF maps as well. If a non-root user calls the function to create a map,
setrlimit() will fail silently (but set errno), and the program will
simply go on with its rlimit unchanged.

> 
> As far as I know there is not API to get the current amount of memory
> locked for a user, therefore we cannot raise the limit only when
> required. One solution, used by bcc, is to try to create the map, and on
> getting a EPERM error, raising the limit to infinity before giving
> another try. Another approach, used in iproute, is to raise the limit in
> all cases, before trying to create the map.
> 
> Here we do the same as in iproute2: the rlimit is raised to infinity
> before trying to load the map.
> 
> I send this patch as a RFC to see if people would prefer the bcc
> approach instead, or the rlimit change to be in bpftool rather than in
> libbpf.
> 
> Signed-off-by: Quentin Monnet 
> ---
>  tools/lib/bpf/bpf.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> index 03f9bcc4ef50..456a5a7b112c 100644
> --- a/tools/lib/bpf/bpf.c
> +++ b/tools/lib/bpf/bpf.c
> @@ -26,6 +26,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include "bpf.h"
>  #include "libbpf.h"
>  #include 
> @@ -68,8 +70,11 @@ static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr 
> *attr,
>  int bpf_create_map_xattr(const struct bpf_create_map_attr *create_attr)
>  {
>   __u32 name_len = create_attr->name ? strlen(create_attr->name) : 0;
> + struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
>   union bpf_attr attr;
>  
> + setrlimit(RLIMIT_MEMLOCK, &rinf);
> +
>   memset(&attr, '\0', sizeof(attr));
>  
>   attr.map_type = create_attr->map_type;
>

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

2018-11-01 Thread David Ahern

On 11/1/18 7:52 AM, Paweł Staszewski wrote:
> 
> 
> W dniu 01.11.2018 o 11:55, Jesper Dangaard Brouer pisze:
>> On Wed, 31 Oct 2018 21:37:16 -0600 David Ahern  wrote:
>>
>>> This is mainly a forwarding use case? Seems so based on the perf report.
>>> I suspect forwarding with XDP would show pretty good improvement.
>> Yes, significant performance improvements.
>>
>> Notice Davids talk: "Leveraging Kernel Tables with XDP"
>>   http://vger.kernel.org/lpc-networking2018.html#session-1
> It will be rly interesting

It's pushing the exact use case you have: FRR manages the FIB, XDP
programs get access to updates as they happen for fast path forwarding.

> 
>> It looks like that you are doing "pure" IP-routing, without any
>> iptables conntrack stuff (from your perf report data).  That will
>> actually be a really good use-case for accelerating this with XDP.
> Yes pure IP routing
> iptables used only for some local input filtering.
> 
> 
>>
>> I want you to understand the philosophy behind how David and I want
>> people to leverage XDP.  Think of XDP as a software offload layer for
>> the kernel network stack. Setup and use Linux kernel network stack, but
>> accelerate parts of it with XDP, e.g. the route FIB lookup.
>>
>> Sample code avail here:
>>  
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/bpf/xdp_fwd_kern.c
>>
> I can try some tests on same hw but testlab configuration - will give it
> a try :)
> 

That version does not work with VLANs. I have patches for it but it
needs a bit more work before sending out. Perhaps I can get back to it
next week.

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

W dniu 01.11.2018 o 18:23, David Ahern pisze:

On 11/1/18 7:52 AM, Paweł Staszewski wrote:

W dniu 01.11.2018 o 11:55, Jesper Dangaard Brouer pisze:

On Wed, 31 Oct 2018 21:37:16 -0600 David Ahern wrote:

This is mainly a forwarding use case? Seems so based on the perf report.
I suspect forwarding with XDP would show pretty good improvement.

Yes, significant performance improvements.

Notice Davids talk: "Leveraging Kernel Tables with XDP"
http://vger.kernel.org/lpc-networking2018.html#session-1

It will be rly interesting

It's pushing the exact use case you have: FRR manages the FIB, XDP
programs get access to updates as they happen for fast path forwarding.

Cant wait then :)

It looks like that you are doing "pure" IP-routing, without any
iptables conntrack stuff (from your perf report data). That will
actually be a really good use-case for accelerating this with XDP.

Yes pure IP routing
iptables used only for some local input filtering.

Sample code avail here:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/bpf/xdp_fwd_kern.c

I can try some tests on same hw but testlab configuration - will give it
a try :)

That version does not work with VLANs. I have patches for it but it
needs a bit more work before sending out. Perhaps I can get back to it
next week.

Will be nice - next week i will be able to replace network controller
and install separate two 100Gbit nics into two pciex x16 slots - so can
test without hitting pcie bandwidth limits.

Re: [PATCH net] qmi_wwan: Support dynamic config on Quectel EP06

2018-11-01 Thread Kristian Evensen

Hi,

On Sat, Sep 8, 2018 at 1:50 PM Kristian Evensen
 wrote:
> Quectel EP06 (and EM06/EG06) supports dynamic configuration of USB
> interfaces, without the device changing VID/PID or configuration number.
> When the configuration is updated and interfaces are added/removed, the
> interface numbers change. This means that the current code for matching
> EP06 does not work.

Would it be possible to have this patch added to stable? I checked
both the 4.14-tree and the stable queue, but could not find the
patch/upstream commit.

Thanks,
Kristian

Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets

2018-11-01 Thread Eric Dumazet




On 11/01/2018 10:58 AM, Leif Hedstrom wrote:
> 
> 
>> On Oct 31, 2018, at 6:53 PM, Eric Dumazet  wrote:
>>
>>
>>
>> On 10/31/2018 04:26 PM, Christoph Paasch wrote:
>>> Implementations of Quic might want to create a separate socket for each
>>> Quic-connection by creating a connected UDP-socket.
>>>
>>
>> Nice proposal, but I doubt a QUIC server can afford having one UDP socket 
>> per connection ?
> 
> First thing: This is an idea we’ve been floating, and it’s not completed yet, 
> so we don’t have any performance numbers etc. to share. The ideas for the 
> implementation came up after a discussion with Ian and Jana re: their 
> implementation of a QUIC server.
> 
> That much said, the general rationale for this is that having a socket for 
> each QUIC connection could simplify integrating QUIC into existing software 
> that already does epoll() over TCP sockets. This is how e.g. Apache Traffic 
> Server works, which is our target implementation for QUIC.
> 
> 
> 
>>
>> It would add a huge overhead in term of memory usage in the kernel,
>> and lots of epoll events to manage (say a QUIC server with one million 
>> flows, receiving
>> very few packets per second per flow)
> 
> Our use case is not millions of sockets, rather, 10’s of thousands. There 
> would be one socket for each QUIC Connection, not per stream (obviously). At 
> ~80Gbps on a box, we definitely see much less than 100k TCP connections.
> 
> Question: is there additional memory overhead here for the UDP sockets vs a 
> normal TCP socket for e.g. HTTP or HTTP/2 ?

TCP sockets have a lot of state. We can understand spending 2 or 3 KB per 
socket.

UDP sockets really have no state. The receive queue anchor is only 24 bytes.
Still, memory cost for one UDP socket are :

1344 bytes for UDP socket,
320 bytes for the "struct file"
192 bytes for the struct dentry
704 bytes for inode
512 bytes for the two dst (connected socket)
200 bytes for eventpoll structures
104 bytes for the fq flow

That is about 3.1KB per socket (but you probably can round this to 4KB due to 
kmalloc roundings)

One million sockets -> 4GB of memory.

This really does not scale.

Help with the BPF verifier

2018-11-01 Thread Arnaldo Carvalho de Melo

tl;dr: I seem to be trying to get past clang optimizations that get the
   verifier to accept my proggie.

Hi,

So I'm moving to use raw_syscalls:sys_exit to collect pointer
contents, using maps to tell the bpf program what to copy, how many
bytes, filters, etc.

I'm at the start of it at this point I need to use an index to
get to the right syscall arg that is a filename, starting just with
"open" and "openat", that have the filename in different args, so to get
this first part working I'm doing it directly in the bpf restricted C
program, later this will be to maps, etc, so if I set the index as a
constant, just for testing, it works, look at the "open" and "openat"
calls below, later we'll see why openat is failing to augment its
"filename" arg while "open" works:

[root@seventh perf]# trace -e tools/perf/examples/bpf/augmented_raw_syscalls.c 
sleep 1
 ? ( ): sleep/10152  ... [continued]: execve()) = 0
 0.045 ( 0.004 ms): sleep/10152 brk() = 0x55ccff356000
 0.074 ( 0.007 ms): sleep/10152 access(filename: , mode: R) = -1 ENOENT No 
such file or directory
 0.089 ( 0.006 ms): sleep/10152 openat(dfd: CWD, filename: , flags: 
CLOEXEC) = 3
 0.097 ( 0.003 ms): sleep/10152 fstat(fd: 3, statbuf: 0x7ffecdd283f0) = 0
 0.103 ( 0.006 ms): sleep/10152 mmap(len: 103334, prot: READ, flags: 
PRIVATE, fd: 3) = 0x7f8ffee9c000
 0.111 ( 0.002 ms): sleep/10152 close(fd: 3) = 0
 0.135 ( 0.007 ms): sleep/10152 openat(dfd: CWD, filename: , flags: 
CLOEXEC) = 3
 0.144 ( 0.003 ms): sleep/10152 read(fd: 3, buf: 0x7ffecdd285b8, count: 
832) = 832
 0.150 ( 0.002 ms): sleep/10152 fstat(fd: 3, statbuf: 0x7ffecdd28450) = 0
 0.155 ( 0.005 ms): sleep/10152 mmap(len: 8192, prot: READ|WRITE, flags: 
PRIVATE|ANONYMOUS) = 0x7f8ffee9a000
 0.166 ( 0.007 ms): sleep/10152 mmap(len: 3889792, prot: EXEC|READ, flags: 
PRIVATE|DENYWRITE, fd: 3) = 0x7f8ffe8dc000
 0.175 ( 0.010 ms): sleep/10152 mprotect(start: 0x7f8ffea89000, len: 
2093056) = 0
 0.188 ( 0.010 ms): sleep/10152 mmap(addr: 0x7f8ffec88000, len: 24576, 
prot: READ|WRITE, flags: PRIVATE|FIXED|DENYWRITE, fd: 3, off: 1753088) = 
0x7f8ffec88000
 0.204 ( 0.005 ms): sleep/10152 mmap(addr: 0x7f8ffec8e000, len: 14976, 
prot: READ|WRITE, flags: PRIVATE|FIXED|ANONYMOUS) = 0x7f8ffec8e000
 0.218 ( 0.002 ms): sleep/10152 close(fd: 3) = 0
 0.239 ( 0.002 ms): sleep/10152 arch_prctl(option: 4098, arg2: 
140256433779968) = 0
 0.312 ( 0.009 ms): sleep/10152 mprotect(start: 0x7f8ffec88000, len: 16384, 
prot: READ) = 0
 0.343 ( 0.005 ms): sleep/10152 mprotect(start: 0x55ccff1c6000, len: 4096, 
prot: READ) = 0
 0.354 ( 0.006 ms): sleep/10152 mprotect(start: 0x7f8ffeeb6000, len: 4096, 
prot: READ) = 0
 0.362 ( 0.019 ms): sleep/10152 munmap(addr: 0x7f8ffee9c000, len: 103334) = 0
 0.476 ( 0.002 ms): sleep/10152 brk() = 0x55ccff356000
 0.480 ( 0.004 ms): sleep/10152 brk(brk: 0x55ccff377000) = 0x55ccff377000
 0.487 ( 0.002 ms): sleep/10152 brk() = 0x55ccff377000
 0.497 ( 0.008 ms): sleep/10152 open(filename: 
/usr/lib/locale/locale-archive, flags: CLOEXEC) = 3
 0.507 ( 0.002 ms): sleep/10152 fstat(fd: 3, statbuf: 0x7f8ffec8daa0) = 0
 0.511 ( 0.006 ms): sleep/10152 mmap(len: 113045344, prot: READ, flags: 
PRIVATE, fd: 3) = 0x7f8ff7d0d000
 0.524 ( 0.002 ms): sleep/10152 close(fd: 3) = 0
 0.574 (1000.140 ms): sleep/10152 nanosleep(rqtp: 0x7ffecdd29130) = 0
  1000.753 ( 0.007 ms): sleep/10152 close(fd: 1) = 0
  1000.767 ( 0.004 ms): sleep/10152 close(fd: 2) = 0
  1000.781 ( ): sleep/10152 exit_group()
[root@seventh perf]# 

 1  // SPDX-License-Identifier: GPL-2.0
 2  /*
 3   * Augment the raw_syscalls tracepoints with the contents of the 
pointer arguments.
 4   *
 5   * Test it with:
 6   *
 7   * perf trace -e tools/perf/examples/bpf/augmented_raw_syscalls.c cat 
/etc/passwd > /dev/null
 8   *
 9   * This exactly matches what is marshalled into the 
raw_syscall:sys_enter
10   * payload expected by the 'perf trace' beautifiers.
11   *
12   * For now it just uses the existing tracepoint augmentation code in 
'perf
13   * trace', in the next csets we'll hook up these with the 
sys_enter/sys_exit
14   * code that will combine entry/exit in a strace like way.
15   */
   
16  #include 
17  #include 
   
18  /* bpf-output associated map */
19  struct bpf_map SEC("maps") __augmented_syscalls__ = {
20  .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
21  .key_size = sizeof(int),
22  .value_size = sizeof(u32),
23  .max_entries = __NR_CPUS__,
24  };
   
25  struct syscall_enter_args {
26  unsigned long long common_tp_fields;
27  long   syscall_nr;
28  unsigned long  args[6];
29  };
   
30  struct syscall_exit_args {
31  unsigned long long common_tp_fields;
32

[Patch net] net: drop skb on failure in ip_check_defrag()

2018-11-01 Thread Cong Wang

Most callers of pskb_trim_rcsum() simply drop the skb when
it fails, however, ip_check_defrag() still continues to pass
the skb up to stack. This is suspicious.

In ip_check_defrag(), after we learn the skb is an IP fragment,
passing the skb to callers makes no sense, because callers expect
fragments are defrag'ed on success. So, dropping the skb when we
can't defrag it is reasonable.

Note, prior to commit 88078d98d1bb, this is not a big problem as
checksum will be fixed up anyway. After it, the checksum is not
correct on failure.

Found this during code review.

Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends")
Cc: Eric Dumazet 
Signed-off-by: Cong Wang 
---
 net/ipv4/ip_fragment.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 9b0158fa431f..d6ee343fdb86 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -722,10 +722,14 @@ struct sk_buff *ip_check_defrag(struct net *net, struct 
sk_buff *skb, u32 user)
if (ip_is_fragment(&iph)) {
skb = skb_share_check(skb, GFP_ATOMIC);
if (skb) {
-   if (!pskb_may_pull(skb, netoff + iph.ihl * 4))
-   return skb;
-   if (pskb_trim_rcsum(skb, netoff + len))
-   return skb;
+   if (!pskb_may_pull(skb, netoff + iph.ihl * 4)) {
+   kfree_skb(skb);
+   return NULL;
+   }
+   if (pskb_trim_rcsum(skb, netoff + len)) {
+   kfree_skb(skb);
+   return NULL;
+   }
memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
if (ip_defrag(net, skb, user))
return NULL;
-- 
2.14.5

Re: [Patch net] net: drop skb on failure in ip_check_defrag()

2018-11-01 Thread Eric Dumazet




On 11/01/2018 12:02 PM, Cong Wang wrote:
> Most callers of pskb_trim_rcsum() simply drop the skb when
> it fails, however, ip_check_defrag() still continues to pass
> the skb up to stack. This is suspicious.
> 
> In ip_check_defrag(), after we learn the skb is an IP fragment,
> passing the skb to callers makes no sense, because callers expect
> fragments are defrag'ed on success. So, dropping the skb when we
> can't defrag it is reasonable.
> 
> Note, prior to commit 88078d98d1bb, this is not a big problem as
> checksum will be fixed up anyway. After it, the checksum is not
> correct on failure.
> 
> Found this during code review.
> 
> Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are 
> friends")
> Cc: Eric Dumazet 
> Signed-off-by: Cong Wang 

Thanks Cong !

Reviewed-by: Eric Dumazet

Re: Help with the BPF verifier

From: Arnaldo Carvalho de Melo 
Date: Thu, 1 Nov 2018 15:52:17 -0300

> 50unsigned int filename_arg = 6;
 ...
> --- /wb/augmented_raw_syscalls.c.old  2018-11-01 15:43:55.000394234 -0300
> +++ /wb/augmented_raw_syscalls.c  2018-11-01 15:44:15.102367838 -0300
> @@ -67,7 +67,7 @@
>   augmented_args.filename.reserved = 0;
>   augmented_args.filename.size = 
> probe_read_str(&augmented_args.filename.value,
> 
> sizeof(augmented_args.filename.value),
> -   (const void 
> *)args->args[0]);
> +   (const void 
> *)args->args[filename_arg]);

args[] is sized to '6', therefore the last valid index is '5', yet you're using 
'6' here which
is one entry past the end of the declared array.

Re: Help with the BPF verifier

2018-11-01 Thread Arnaldo Carvalho de Melo

Em Thu, Nov 01, 2018 at 12:10:39PM -0700, David Miller escreveu:
> From: Arnaldo Carvalho de Melo 
> Date: Thu, 1 Nov 2018 15:52:17 -0300
> 
> > 50  unsigned int filename_arg = 6;
>  ...
> > --- /wb/augmented_raw_syscalls.c.old2018-11-01 15:43:55.000394234 
> > -0300
> > +++ /wb/augmented_raw_syscalls.c2018-11-01 15:44:15.102367838 -0300
> > @@ -67,7 +67,7 @@
> > augmented_args.filename.reserved = 0;
> > augmented_args.filename.size = 
> > probe_read_str(&augmented_args.filename.value,
> >   
> > sizeof(augmented_args.filename.value),
> > - (const void 
> > *)args->args[0]);
> > + (const void 
> > *)args->args[filename_arg]);
> 
> args[] is sized to '6', therefore the last valid index is '5', yet you're 
> using '6' here which
> is one entry past the end of the declared array.

Nope... this is inside an if:

if (filename_arg <= 5) {
augmented_args.filename.reserved = 0;
augmented_args.filename.size = 
probe_read_str(&augmented_args.filename.value,
  
sizeof(augmented_args.filename.value),
  (const void 
*)args->args[filename_arg]);
if (augmented_args.filename.size < 
sizeof(augmented_args.filename.value)) {
len -= sizeof(augmented_args.filename.value) - 
augmented_args.filename.size;
len &= sizeof(augmented_args.filename.value) - 1;
}
} else {

I use 6 to mean "hey, this syscall doesn't have any string argument, don't
bother with it".

- Arnaldo

Re: [PATCH bpf 1/4] bpf: fix partial copy of map_ptr when dst is scalar

2018-11-01 Thread Edward Cree

On 31/10/18 23:05, Daniel Borkmann wrote:
> ALU operations on pointers such as scalar_reg += map_value_ptr are
> handled in adjust_ptr_min_max_vals(). Problem is however that map_ptr
> and range in the register state share a union, so transferring state
> through dst_reg->range = ptr_reg->range is just buggy as any new
> map_ptr in the dst_reg is then truncated (or null) for subsequent
> checks. Fix this by adding a raw member and use it for copying state
> over to dst_reg.
>
> Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
> Signed-off-by: Daniel Borkmann 
> Cc: Edward Cree 
> Acked-by: Alexei Starovoitov 
> ---
Acked-by: Edward Cree 
(though I apparently missed the 63-minute window to hit the git record...)

Re: [PATCH bpf 1/4] bpf: fix partial copy of map_ptr when dst is scalar

2018-11-01 Thread Arnaldo Carvalho de Melo

Em Thu, Nov 01, 2018 at 07:17:29PM +, Edward Cree escreveu:
> On 31/10/18 23:05, Daniel Borkmann wrote:
> > ALU operations on pointers such as scalar_reg += map_value_ptr are
> > handled in adjust_ptr_min_max_vals(). Problem is however that map_ptr
> > and range in the register state share a union, so transferring state
> > through dst_reg->range = ptr_reg->range is just buggy as any new
> > map_ptr in the dst_reg is then truncated (or null) for subsequent
> > checks. Fix this by adding a raw member and use it for copying state
> > over to dst_reg.
> >
> > Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
> > Signed-off-by: Daniel Borkmann 
> > Cc: Edward Cree 
> > Acked-by: Alexei Starovoitov 
> > ---
> Acked-by: Edward Cree 
> (though I apparently missed the 63-minute window to hit the git record...)

Those guys are fast! :-)

- Arnaldo

RE: [PATCH net-next] igc: Remove set but not used variables 'ctrl_ext, link_mode'

2018-11-01 Thread Brown, Aaron F

> From: netdev-ow...@vger.kernel.org [mailto:netdev-
> ow...@vger.kernel.org] On Behalf Of YueHaibing
> Sent: Friday, October 19, 2018 5:41 AM
> To: Kirsher, Jeffrey T ; Neftin, Sasha
> 
> Cc: YueHaibing ; intel-wired-...@lists.osuosl.org;
> netdev@vger.kernel.org; kernel-janit...@vger.kernel.org
> Subject: [PATCH net-next] igc: Remove set but not used variables 'ctrl_ext,
> link_mode'
> 
> Fixes gcc '-Wunused-but-set-variable' warning:
> 
> drivers/net/ethernet/intel/igc/igc_base.c: In function
> 'igc_init_phy_params_base':
> drivers/net/ethernet/intel/igc/igc_base.c:240:6: warning:
>  variable 'ctrl_ext' set but not used [-Wunused-but-set-variable]
>   u32 ctrl_ext;
> 
> drivers/net/ethernet/intel/igc/igc_base.c: In function
> 'igc_get_invariants_base':
> drivers/net/ethernet/intel/igc/igc_base.c:290:6: warning:
>  variable 'link_mode' set but not used [-Wunused-but-set-variable]
>   u32 link_mode = 0;
> 
> It never used since introduction in
> commit c0071c7aa5fe ("igc: Add HW initialization code")
> 
> Signed-off-by: YueHaibing 
> ---
> I'm not sure that reading IGC_CTRL_EXT is necessary.
> ---
>  drivers/net/ethernet/intel/igc/igc_base.c | 8 
>  1 file changed, 8 deletions(-)
> 

Tested-by: Aaron Brown

RE: [PATCH net-next] igc: Remove set but not used variable 'pci_using_dac'

2018-11-01 Thread Brown, Aaron F

> From: netdev-ow...@vger.kernel.org [mailto:netdev-
> ow...@vger.kernel.org] On Behalf Of YueHaibing
> Sent: Friday, October 19, 2018 5:48 AM
> To: Kirsher, Jeffrey T ; David S. Miller
> ; Neftin, Sasha 
> Cc: YueHaibing ; intel-wired-...@lists.osuosl.org;
> netdev@vger.kernel.org; kernel-janit...@vger.kernel.org
> Subject: [PATCH net-next] igc: Remove set but not used variable
> 'pci_using_dac'
> 
> Fixes gcc '-Wunused-but-set-variable' warning:
> 
> drivers/net/ethernet/intel/igc/igc_main.c: In function 'igc_probe':
> drivers/net/ethernet/intel/igc/igc_main.c:3535:11: warning:
>  variable 'pci_using_dac' set but not used [-Wunused-but-set-variable]
> 
> It never used since introduction in commit
> d89f88419f99 ("igc: Add skeletal frame for Intel(R) 2.5G Ethernet Controller
> support")
> 
> Signed-off-by: YueHaibing 
> ---
>  drivers/net/ethernet/intel/igc/igc_main.c | 5 +
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 

Tested-by: Aaron Brown

RE: [Intel-wired-lan] [PATCH][next] igc: fix error return handling from call to netif_set_real_num_tx_queues

2018-11-01 Thread Brown, Aaron F

> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On
> Behalf Of Colin King
> Sent: Friday, October 19, 2018 11:16 AM
> To: Neftin, Sasha ; Kirsher, Jeffrey T
> ; David S . Miller ;
> intel-wired-...@lists.osuosl.org
> Cc: netdev@vger.kernel.org; kernel-janit...@vger.kernel.org
> Subject: [Intel-wired-lan] [PATCH][next] igc: fix error return handling from
> call to netif_set_real_num_tx_queues
> 
> From: Colin Ian King 
> 
> The call to netif_set_real_num_tx_queues is not assigning the error
> return to variable err even though the next line checks err for an
> error.  Fix this by adding the missing err assignment.
> 
> Detected by CoverityScan, CID#1474551 ("Logically dead code")
> 
> Fixes: 3df25e4c1e66 ("igc: Add interrupt support")
> Signed-off-by: Colin Ian King 
> ---
>  drivers/net/ethernet/intel/igc/igc_main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 

Tested-by: Aaron Brown

Re: Help with the BPF verifier

From: Arnaldo Carvalho de Melo 
Date: Thu, 1 Nov 2018 16:13:10 -0300

> Nope... this is inside an if:
> 
> if (filename_arg <= 5) {
> augmented_args.filename.reserved = 0;
> augmented_args.filename.size = 
> probe_read_str(&augmented_args.filename.value,
>   
> sizeof(augmented_args.filename.value),
>   (const void 
> *)args->args[filename_arg]);
> if (augmented_args.filename.size < 
> sizeof(augmented_args.filename.value)) {
> len -= sizeof(augmented_args.filename.value) - 
> augmented_args.filename.size;
> len &= sizeof(augmented_args.filename.value) - 1;
> }
> } else {
> 
> I use 6 to mean "hey, this syscall doesn't have any string argument, don't
> bother with it".

Really weird.  And it's unsigned so I can't imagine it wants you to
check that it's >= 0...

Maybe Deniel or someone else can figure it out.

Re: [PATCH net] qmi_wwan: Support dynamic config on Quectel EP06

2018-11-01 Thread Kristian Evensen

On Thu, Nov 1, 2018 at 6:40 PM Kristian Evensen
 wrote:
>
> Hi,
>
> On Sat, Sep 8, 2018 at 1:50 PM Kristian Evensen
>  wrote:
> > Quectel EP06 (and EM06/EG06) supports dynamic configuration of USB
> > interfaces, without the device changing VID/PID or configuration number.
> > When the configuration is updated and interfaces are added/removed, the
> > interface numbers change. This means that the current code for matching
> > EP06 does not work.
>
> Would it be possible to have this patch added to stable? I checked
> both the 4.14-tree and the stable queue, but could not find the
> patch/upstream commit.

Please ignore this request. I discovered patch does not apply to 4.14,
so I will do a manual backport and send to stable.

BR,
Kristian

Re: [PATCH net] qmi_wwan: Support dynamic config on Quectel EP06

2018-11-01 Thread Kristian Evensen

On Thu, Nov 1, 2018 at 8:30 PM Kristian Evensen
 wrote:
>
> On Thu, Nov 1, 2018 at 6:40 PM Kristian Evensen
>  wrote:
> >
> > Hi,
> >
> > On Sat, Sep 8, 2018 at 1:50 PM Kristian Evensen
> >  wrote:
> > > Quectel EP06 (and EM06/EG06) supports dynamic configuration of USB
> > > interfaces, without the device changing VID/PID or configuration number.
> > > When the configuration is updated and interfaces are added/removed, the
> > > interface numbers change. This means that the current code for matching
> > > EP06 does not work.
> >
> > Would it be possible to have this patch added to stable? I checked
> > both the 4.14-tree and the stable queue, but could not find the
> > patch/upstream commit.
>
> Please ignore this request. I discovered patch does not apply to 4.14,
> so I will do a manual backport and send to stable.

Blah, it is clearly not my day today. I discovered that my 4.14 build
directory was dirty and that the patch applies fined on top of
4.14.78. Sorry about the extra noise and please do not ignore my
request for stable :)

BR,
Kristian

Re: [PATCH iproute2] ip rule: Require at least one argument for add

2018-11-01 Thread Stephen Hemminger

On Tue, 30 Oct 2018 13:59:05 -0700
David Ahern  wrote:

> From: David Ahern 
> 
> 'ip rule add' with no additional arguments just adds another rule
> for the main table - which exists by default. Require at least
> 1 argument similar to delete.
> 
> Signed-off-by: David Ahern 

Applied these two

Re: Help with the BPF verifier

2018-11-01 Thread Edward Cree

On 01/11/18 18:52, Arnaldo Carvalho de Melo wrote:
>  R0=inv(id=0) R1=inv6 R2=inv6 R3=inv(id=0) R6=ctx(id=0,off=0,imm=0) R7=inv64 
> R10=fp0,call_-1
> 15: (b7) r2 = 0
> 16: (63) *(u32 *)(r10 -260) = r2
> 17: (67) r1 <<= 32
> 18: (77) r1 >>= 32
> 19: (67) r1 <<= 3
> 20: (bf) r2 = r6
> 21: (0f) r2 += r1
> 22: (79) r3 = *(u64 *)(r2 +16)
> R2 invalid mem access 'inv'
I wonder if you could run this with verifier log level 2?  (I'm not sure how
 you would go about plumbing that through the perf tooling.)  It seems very
 odd that it ends up with R2=inv, and I'm wondering whether R1 becomes unknown
 during the shifts or whether the addition in insn 21 somehow produces the
 unknown-ness.  (I know we used to have a thing[1] where doing ptr += K and
 then also having an offset in the LDX produced an error about
 ptr+const+const, but that seems to have been fixed at some point.)

Note however that even if we get past this, R1 at this point holds 6, so it
 looks like the verifier is walking the impossible path where we're inside the
 'if' even though filename_arg = 6.  This is a (slightly annoying) verifier
 limitation, that it walks paths with impossible combinations of constraints
 (we've previously had cases where assertions in the verifier would blow up
 because of this, e.g. registers with max_val less than min_val).  So if the
 check_ctx_access() is going to worry about whether you're off the end of the
 array (I'm not sure what your program type is and thus which is_valid_access
 callback is involved), then it'll complain about this.
If filename_arg came from some external source you'd have a different
 problem, because then it would have a totally unknown value, that on entering
 the 'if' becomes "unknown but < 6", which is still too variable to have as
 the offset of a ctx access.  Those have to be at a known constant offset, so
 that we can determine the type of the returned value.

As a way to fix this, how about [UNTESTED!]:
    const void *filename_arg = NULL;
    /* ... */
    switch (augmented_args.args.syscall_nr) {
        case SYS_OPEN: filename_arg = args->args[0]; break;
        case SYS_OPENAT: filename_arg = args->args[1]; break;
    }
    /* ... */
    if (filename_arg) {
    /* stuff */
    blah = probe_read_str(/* ... */, filename_arg);
    } else {
    /* the other stuff */
    }
That way, you're only ever dealing in constant pointers (although judging by
 an old thread I found[1] about ptr+const+const, the compiler might decide to
 make some optimisations that end up looking like your existing code).

As for what you want to do with the index coming from userspace, the verifier
 will not like that at all, as mentioned above, so I think you'll need to do
 something like:
    switch (filename_arg_from_userspace) {
    case 0: filename_arg = args->args[0]; break;
    case 1: filename_arg = args->args[1]; break;
    /* etc */
    default: filename_arg = NULL;
    }
 thus ensuring that you only ever have ctx pointers with constant offsets.

-Ed

[1]: https://lists.iovisor.org/g/iovisor-dev/topic/21386327#1302

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

2018-11-01 Thread Saeed Mahameed

On Thu, 2018-11-01 at 23:27 +0800, Aaron Lu wrote:
> On Thu, Nov 01, 2018 at 10:22:13AM +0100, Jesper Dangaard Brouer
> wrote:
> ... ...
> > Section copied out:
> > 
> >   mlx5e_poll_tx_cq
> >   |  
> >--16.34%--napi_consume_skb
> >  |  
> >  |--12.65%--__free_pages_ok
> >  |  |  
> >  |   --11.86%--free_one_page
> >  | |  
> >  | |--10.10%
> > --queued_spin_lock_slowpath
> >  | |  
> >  |  --0.65%--_raw_spin_lock
> 
> This callchain looks like it is freeing higher order pages than order
> 0:
> __free_pages_ok is only called for pages whose order are bigger than
> 0.

mlx5 rx uses only order 0 pages, so i don't know where these high order
tx SKBs are coming from.. 

> 
> >  |  
> >  |--1.55%--page_frag_free
> >  |  
> >   --1.44%--skb_release_data
> > 
> > 
> > Let me explain what (I think) happens.  The mlx5 driver RX-page
> > recycle
> > mechanism is not effective in this workload, and pages have to go
> > through the page allocator.  The lock contention happens during
> > mlx5
> > DMA TX completion cycle.  And the page allocator cannot keep up at
> > these speeds.
> > 
> > One solution is extend page allocator with a bulk free API.  (This
> > have
> > been on my TODO list for a long time, but I don't have a
> > micro-benchmark that trick the driver page-recycle to fail).  It
> > should
> > fit nicely, as I can see that kmem_cache_free_bulk() does get
> > activated (bulk freeing SKBs), which means that DMA TX completion
> > do
> > have a bulk of packets. 
> > 
> > We can (and should) also improve the page recycle scheme in the
> > driver.
> > After LPC, I have a project with Tariq and Ilias (Cc'ed) to improve
> > the
> > page_pool, and we will (attempt) to generalize this, for both high-
> > end
> > mlx5 and more low-end ARM64-boards (macchiatobin and espressobin).
> > 
> > The MM-people is in parallel working to improve the performance of
> > order-0 page returns.  Thus, the explicit page bulk free API might
> > actually become less important.  I actually think (Cc.) Aaron have
> > a
> > patchset he would like you to test, which removes the (zone->)lock
> > you hit in free_one_page().
> 
> Thanks Jesper.
> 
> Yes, the said patchset is in this branch:
> https://github.com/aaronlu/linux no_merge_cluster_alloc_4.19-rc5
> 
> But as I said above, I think the lock contention here is for
> order > 0 pages so my current patchset will not work here,
> unfortunately.
> 
> BTW, Mel Gorman has suggested an alternative way to improve page
> allocator's scalability and I'm working on it right now, it will
> improve page allocator's scalability for all order pages. I might be
> able to post it some time next week, will CC all of you when it's
> ready.

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

2018-11-01 Thread Saeed Mahameed

On Thu, 2018-11-01 at 12:09 +0100, Paweł Staszewski wrote:
> 
> W dniu 01.11.2018 o 10:50, Saeed Mahameed pisze:
> > On Wed, 2018-10-31 at 22:57 +0100, Paweł Staszewski wrote:
> > > Hi
> > > 
> > > So maybee someone will be interested how linux kernel handles
> > > normal
> > > traffic (not pktgen :) )
> > > 
> > > 
> > > Server HW configuration:
> > > 
> > > CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
> > > 
> > > NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)
> > > 
> > > 
> > > Server software:
> > > 
> > > FRR - as routing daemon
> > > 
> > > enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to
> > > local
> > > numa
> > > node)
> > > 
> > > enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local
> > > numa
> > > node)
> > > 
> > > 
> > > Maximum traffic that server can handle:
> > > 
> > > Bandwidth
> > > 
> > >bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
> > > input: /proc/net/dev type: rate
> > > \ iface   Rx TxTotal
> > > =
> > > 
> > > =
> > >  enp175s0f1:  28.51 Gb/s   37.24
> > > Gb/s
> > > 65.74 Gb/s
> > >  enp175s0f0:  38.07 Gb/s   28.44
> > > Gb/s
> > > 66.51 Gb/s
> > > ---
> > > 
> > > ---
> > >   total:  66.58 Gb/s   65.67
> > > Gb/s
> > > 132.25 Gb/s
> > > 
> > > 
> > > Packets per second:
> > > 
> > >bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
> > > input: /proc/net/dev type: rate
> > > - iface   Rx TxTotal
> > > =
> > > 
> > > =
> > >  enp175s0f1:  5248589.00 P/s   3486617.75 P/s
> > > 8735207.00 P/s
> > >  enp175s0f0:  3557944.25 P/s   5232516.00 P/s
> > > 8790460.00 P/s
> > > ---
> > > 
> > > ---
> > >   total:  8806533.00 P/s   8719134.00 P/s
> > > 17525668.00 P/s
> > > 
> > > 
> > > After reaching that limits nics on the upstream side (more RX
> > > traffic)
> > > start to drop packets
> > > 
> > > 
> > > I just dont understand that server can't handle more bandwidth
> > > (~40Gbit/s is limit where all cpu's are 100% util) - where pps on
> > > RX
> > > side are increasing.
> > > 
> > 
> > Where do you see 40 Gb/s ? you showed that both ports on the same
> > NIC (
> > same pcie link) are doing  66.58 Gb/s (RX) + 65.67 Gb/s (TX) =
> > 132.25
> > Gb/s which aligns with your pcie link limit, what am i missing ?
> 
> hmm yes that was my concern also - cause cant find anywhere
> informations 
> about that bandwidth is uni or bidirectional - so if 126Gbit for x16
> 8GT 
> is unidir - then bidir will be 126/2 ~68Gbit - which will fit total
> bw 
> on both ports

i think it is bidir 

> This can explain maybee also why cpuload is rising rapidly from 
> 120Gbit/s in total to 132Gbit (counters of bwmng are from /proc/net -
> so 
> there can be some error in reading them when offloading (gro/gso/tso)
> on 
> nic's is enabled that is why
> 
> > 
> > > Was thinking that maybee reached some pcie x16 limit - but x16
> > > 8GT
> > > is
> > > 126Gbit - and also when testing with pktgen i can reach more bw
> > > and
> > > pps
> > > (like 4x more comparing to normal internet traffic)
> > > 
> > 
> > Are you forwarding when using pktgen as well or you just testing
> > the RX
> > side pps ?
> 
> Yes pktgen was tested on single port RX
> Can check also forwarding to eliminate pciex limits
> 

So this explains why you have more RX pps, since tx is idle and pcie
will be free to do only rx.

[...]


> > 
> > > ethtool -S enp175s0f1
> > > NIC statistics:
> > >rx_packets: 173730800927
> > >rx_bytes: 99827422751332
> > >tx_packets: 142532009512
> > >tx_bytes: 184633045911222
> > >tx_tso_packets: 25989113891
> > >tx_tso_bytes: 132933363384458
> > >tx_tso_inner_packets: 0
> > >tx_tso_inner_bytes: 0
> > >tx_added_vlan_packets: 74630239613
> > >tx_nop: 2029817748
> > >rx_lro_packets: 0
> > >rx_lro_bytes: 0
> > >rx_ecn_mark: 0
> > >rx_removed_vlan_packets: 173730800927
> > >rx_csum_unnecessary: 0
> > >rx_csum_none: 434357
> > >rx_csum_complete: 173730366570
> > >rx_csum_unnecessary_inner: 0
> > >rx_xdp_drop: 0
> > >rx_xdp_redirect: 0
> > >rx_xdp_tx_xmit: 0
> > >rx_xdp_tx_full: 0
> > >rx_xdp_tx_err: 0
> > >rx_xdp_tx_cqe: 0
> > >tx_csum_none: 38260960853
> > >tx_csum_partial: 36369278774
> > >tx_csum_partial_inner: 0
> > >tx_queue_stopped: 1
> > >tx_queue_dropped: 0
> > >tx_xmit_more: 748638099
> > >

Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets

2018-11-01 Thread Leif Hedstrom

> On Oct 31, 2018, at 6:53 PM, Eric Dumazet  wrote:
> 
> 
> 
> On 10/31/2018 04:26 PM, Christoph Paasch wrote:
>> Implementations of Quic might want to create a separate socket for each
>> Quic-connection by creating a connected UDP-socket.
>> 
> 
> Nice proposal, but I doubt a QUIC server can afford having one UDP socket per 
> connection ?

First thing: This is an idea we’ve been floating, and it’s not completed yet, 
so we don’t have any performance numbers etc. to share. The ideas for the 
implementation came up after a discussion with Ian and Jana re: their 
implementation of a QUIC server.

That much said, the general rationale for this is that having a socket for each 
QUIC connection could simplify integrating QUIC into existing software that 
already does epoll() over TCP sockets. This is how e.g. Apache Traffic Server 
works, which is our target implementation for QUIC.

> 
> It would add a huge overhead in term of memory usage in the kernel,
> and lots of epoll events to manage (say a QUIC server with one million flows, 
> receiving
> very few packets per second per flow)

Our use case is not millions of sockets, rather, 10’s of thousands. There would 
be one socket for each QUIC Connection, not per stream (obviously). At ~80Gbps 
on a box, we definitely see much less than 100k TCP connections.

Question: is there additional memory overhead here for the UDP sockets vs a 
normal TCP socket for e.g. HTTP or HTTP/2 ?

> 
> Maybe you could elaborate on the need of having one UDP socket per connection.

We had a couple reasons:

1) Easier to integrate with existing epoll() based event processing

2) Possibly less CPU usage / faster handling, since scheduling is simplified 
with the epoll integration (untested)

Ian and Jana also had a couple of reasons why this delayed bind could be useful 
for their implementations, but I’ll leave it to them to go into details.

Cheers,

— leif

Re: [Patch net] net: drop skb on failure in ip_check_defrag()

From: Cong Wang 
Date: Thu,  1 Nov 2018 12:02:37 -0700

> Most callers of pskb_trim_rcsum() simply drop the skb when
> it fails, however, ip_check_defrag() still continues to pass
> the skb up to stack. This is suspicious.
> 
> In ip_check_defrag(), after we learn the skb is an IP fragment,
> passing the skb to callers makes no sense, because callers expect
> fragments are defrag'ed on success. So, dropping the skb when we
> can't defrag it is reasonable.
> 
> Note, prior to commit 88078d98d1bb, this is not a big problem as
> checksum will be fixed up anyway. After it, the checksum is not
> correct on failure.
> 
> Found this during code review.
> 
> Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are 
> friends")
> Cc: Eric Dumazet 
> Signed-off-by: Cong Wang 

Applied and queued up for -stable, thanks!

Re: [RFC PATCH v3 01/10] udp: implement complete book-keeping for encap_needed

On Tue, Oct 30, 2018 at 1:28 PM Paolo Abeni  wrote:
>
> The *encap_needed static keys are enabled by UDP tunnels
> and several UDP encapsulations type, but they are never
> turned off. This can cause unneeded overall performance
> degradation for systems where such features are used
> transiently.
>
> This patch introduces complete book-keeping for such keys,
> decreasing the usage at socket destruction time, if needed,
> and avoiding that the same socket could increase the key
> usage multiple times.
>
> rfc v2 - rfc v3:
>  - use udp_tunnel_encap_enable() in setsockopt()
>
> Signed-off-by: Paolo Abeni 

> @@ -2447,7 +2452,7 @@ int udp_lib_setsockopt(struct sock *sk, int level, int 
> optname,
> /* FALLTHROUGH */
> case UDP_ENCAP_L2TPINUDP:
> up->encap_type = val;
> -   udp_encap_enable();
> +   udp_tunnel_encap_enable(sk->sk_socket);

this now also needs lock_sock?

Re: [RFC PATCH v3 08/10] selftests: conditionally enable XDP support in udpgso_bench_rx

On Tue, Oct 30, 2018 at 1:28 PM Paolo Abeni  wrote:
>
> XDP support will be used by a later patch to test the GRO path
> in a net namespace, leveraging the veth XDP implementation.
> To avoid breaking existing setup, XDP support is conditionally
> enabled and build only if llc is locally available.
>
> rfc v2 -> rfc v3:
>  - move 'x' option handling here
>
> Signed-off-by: Paolo Abeni 
> ---
>  tools/testing/selftests/net/Makefile  | 69 +++
>  tools/testing/selftests/net/udpgso_bench_rx.c | 41 ++-
>  tools/testing/selftests/net/xdp_dummy.c   | 13 
>  3 files changed, 121 insertions(+), 2 deletions(-)
>  create mode 100644 tools/testing/selftests/net/xdp_dummy.c
>
> diff --git a/tools/testing/selftests/net/Makefile 
> b/tools/testing/selftests/net/Makefile
> index 256d82d5fa87..176459b7c4d6 100644
> --- a/tools/testing/selftests/net/Makefile
> +++ b/tools/testing/selftests/net/Makefile
> @@ -16,8 +16,77 @@ TEST_GEN_PROGS = reuseport_bpf reuseport_bpf_cpu 
> reuseport_bpf_numa
>  TEST_GEN_PROGS += reuseport_dualstack reuseaddr_conflict tls
>
>  KSFT_KHDR_INSTALL := 1
> +
> +# Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
> cmdline:
> +#  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
> CLANG=~/git/llvm/build/bin/clang
> +LLC ?= llc
> +CLANG ?= clang
> +LLVM_OBJCOPY ?= llvm-objcopy
> +BTF_PAHOLE ?= pahole
> +HAS_LLC := $(shell which $(LLC) 2>/dev/null)
> +
> +# conditional enable testes requiring llc
> +ifneq (, $(HAS_LLC))
> +TEST_GEN_FILES += xdp_dummy.o
> +endif
> +
>  include ../lib.mk
>
> +ifneq (, $(HAS_LLC))
> +
> +# Detect that we're cross compiling and use the cross compiler
> +ifdef CROSS_COMPILE
> +CLANG_ARCH_ARGS = -target $(ARCH)
> +endif
> +
> +PROBE := $(shell $(LLC) -march=bpf -mcpu=probe -filetype=null /dev/null 2>&1)
> +
> +# Let newer LLVM versions transparently probe the kernel for availability
> +# of full BPF instruction set.
> +ifeq ($(PROBE),)
> +  CPU ?= probe
> +else
> +  CPU ?= generic
> +endif
> +
> +SRC_PATH := $(abspath ../../../..)
> +LIB_PATH := $(SRC_PATH)/tools/lib
> +XDP_CFLAGS := -D SUPPORT_XDP=1 -I$(LIB_PATH)
> +LIBBPF = $(LIB_PATH)/bpf/libbpf.a
> +BTF_LLC_PROBE := $(shell $(LLC) -march=bpf -mattr=help 2>&1 | grep dwarfris)
> +BTF_PAHOLE_PROBE := $(shell $(BTF_PAHOLE) --help 2>&1 | grep BTF)
> +BTF_OBJCOPY_PROBE := $(shell $(LLVM_OBJCOPY) --help 2>&1 | grep -i 
> 'usage.*llvm')
> +CLANG_SYS_INCLUDES := $(shell $(CLANG) -v -E - &1 \
> +| sed -n '/<...> search starts here:/,/End of search list./{ s| 
> \(/.*\)|-idirafter \1|p }')
> +CLANG_FLAGS = -I. -I$(SRC_PATH)/include -I../bpf/ \
> + $(CLANG_SYS_INCLUDES) -Wno-compare-distinct-pointer-types
> +
> +ifneq ($(and $(BTF_LLC_PROBE),$(BTF_PAHOLE_PROBE),$(BTF_OBJCOPY_PROBE)),)
> +   CLANG_CFLAGS += -g
> +   LLC_FLAGS += -mattr=dwarfris
> +   DWARF2BTF = y
> +endif
> +
> +$(LIBBPF): FORCE
> +# Fix up variables inherited from Kbuild that tools/ build system won't like
> +   $(MAKE) -C $(dir $@) RM='rm -rf' LDFLAGS= srctree=$(SRC_PATH) O= 
> $(nodir $@)
> +
> +$(OUTPUT)/udpgso_bench_rx: $(OUTPUT)/udpgso_bench_rx.c $(LIBBPF)
> +   $(CC) -o $@ $(XDP_CFLAGS) $(CFLAGS) $(LOADLIBES) $(LDLIBS) $^ -lelf
> +
> +FORCE:
> +
> +# bpf program[s] generation
> +$(OUTPUT)/%.o: %.c
> +   $(CLANG) $(CLANG_FLAGS) \
> +-O2 -target bpf -emit-llvm -c $< -o - |  \
> +   $(LLC) -march=bpf -mcpu=$(CPU) $(LLC_FLAGS) -filetype=obj -o $@
> +ifeq ($(DWARF2BTF),y)
> +   $(BTF_PAHOLE) -J $@
> +endif
> +
> +endif
> +

To get around having to add all this Makefile boilerplate, perhaps
don't integrate the xdp loader with udpgso_bench_rx, but add a
standalone trivial loader to /tools/testing/selftests/bpf. Akin to
samples/xdp1_user.c, but even simpler because without map. Then also
move xdp_dummy.c there.

That does add a cross-directory dependency, but that is not
significantly different from the SUPPORT_XDP conditional dependency
today.

On which note, the test initially silently failed for me because the
binary compiled, but without that option. It is probably better to
fail the test hard and with a clear error if XDP is not supported.
With that aside,

Tested-by: Willem de Bruijn

Re: [RFC PATCH v3 06/10] udp: cope with UDP GRO packet misdirection

On Wed, Oct 31, 2018 at 5:57 AM Paolo Abeni  wrote:
>
> On Tue, 2018-10-30 at 18:24 +0100, Paolo Abeni wrote:
> > --- a/include/net/udp.h
> > +++ b/include/net/udp.h
> > @@ -406,17 +406,24 @@ static inline int copy_linear_skb(struct sk_buff 
> > *skb, int len, int off,
> >  } while(0)
> >
> >  #if IS_ENABLED(CONFIG_IPV6)
> > -#define __UDPX_INC_STATS(sk, field)  \
> > -do { \
> > - if ((sk)->sk_family == AF_INET) \
> > - __UDP_INC_STATS(sock_net(sk), field, 0);\
> > - else\
> > - __UDP6_INC_STATS(sock_net(sk), field, 0);   \
> > -} while (0)
> > +#define __UDPX_MIB(sk, ipv4) \
> > +({   \
> > + ipv4 ? (IS_UDPLITE(sk) ? sock_net(sk)->mib.udplite_statistics : \
> > +  sock_net(sk)->mib.udp_statistics) :\
> > + (IS_UDPLITE(sk) ? sock_net(sk)->mib.udplite_stats_in6 : \
> > +  sock_net(sk)->mib.udp_stats_in6);  \
> > +})
> >  #else
> > -#define __UDPX_INC_STATS(sk, field) __UDP_INC_STATS(sock_net(sk), field, 0)
> > +#define __UDPX_MIB(sk, ipv4) \
> > +({   \
> > + IS_UDPLITE(sk) ? sock_net(sk)->mib.udplite_statistics : \
> > +  sock_net(sk)->mib.udp_statistics;  \
> > +})
> >  #endif
> >
> > +#define __UDPX_INC_STATS(sk, field) \
> > + __SNMP_INC_STATS(__UDPX_MIB(sk, (sk)->sk_family == AF_INET, field)
> > +
>
> This is broken (complains only if CONFIG_AF_RXRPC is set), will fix in
> next iteration (thanks kbuildbot).
>
> But I'd prefer to keep the above helper: it can be used in a follow-up
> patch to cleanup a bit udp6_recvmsg().
>
> >  #ifdef CONFIG_PROC_FS
> >  struct udp_seq_afinfo {
> >   sa_family_t family;
> > @@ -450,4 +457,32 @@ DECLARE_STATIC_KEY_FALSE(udpv6_encap_needed_key);
> >  void udpv6_encap_enable(void);
> >  #endif
> >
> > +static inline struct sk_buff *udp_rcv_segment(struct sock *sk,
> > +   struct sk_buff *skb)
> > +{
> > + bool ipv4 = skb->protocol == htons(ETH_P_IP);
>
> And this cause a compile warning when # CONFIG_IPV6 is not set, I will
> fix in the next iteration (again thanks kbuildbot)

Can also just pass it as argument. This skb->protocol should work correctly
with tunneled packets, but it wasn't as immediately obvious to me.

Also

+   if (unlikely(!segs))
+   goto drop;

this should not happen. But if it could and the caller treats it the
same as error (both now return NULL), then skb needs to be freed.

Re: [PATCH iproute2 net-next 0/3] ss: Allow selection of columns to be displayed

2018-11-01 Thread Jakub Kicinski

On Wed, 31 Oct 2018 20:48:05 -0600, David Ahern wrote:
> >   spacing with a special character in the format string, that is:
> > 
> > "%S.%Qr.%Qs  %Al:%Pl %Ar:%Pr  %p\n"
> > 
> >   would mean "align everything to the right, distribute remaining
> >   whitespace between %S, %Qr and %Qs". But it looks rather complicated
> >   at a glance.
> >   
> 
> My concern here is that once this goes in for 1 command, the others in
> iproute2 need to follow suit - meaning same syntax style for all
> commands. Given that I'd prefer we get a reasonable consensus on syntax
> that will work across commands -- ss, ip, tc. If it is as simple as
> column names with a fixed order, that is fine but just give proper
> consideration given the impact.

FWIW I just started piping iproute2 commands to jq.  Example:

tc -s -j qdisc show dev em1 | \
jq -r '.[] |  
[.kind,.parent,.handle,.offloaded,.bytes,.packets,.drops,.overlimits,.requeues,.backlog,.qlen,.marked]
 | @tsv'

JSONification would probably be quite an undertaking for ss :(

Re: [PATCH bpf-next v2 02/13] bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO

2018-11-01 Thread Edward Cree

I've spent a bit more time thinking about / sleeping on this, and I
 still think there's a major disagreement here.  Basically it seems
 like I'm saying "the design of BTF is wrong" and you're saying "but
 it's the design" (with the possible implication — I'm not entirely
 sure — of "but that's what DWARF does").
So let's back away from the details about FUNC/PROTO, and talk in
 more general terms about what a BTF record means.
There are two classes of things we might want to put in debug-info:
 * There exists a type T
 * I have an instance X (variable, subprogram, etc.) of type T
Both of these may need to reference other types, and have the same
 space of possible things T could be, but there the similarity ends:
 they are semantically different things.
Indeed, the only reason for any record of the first class is to
 define types referenced by records of the second class.  Some
 concrete examples of records of the second class are:
1) I have a map named "foo" with key-type T1 and value-type T2
2) I have a subprogram named "bar" with prototype T3
3) I am using stack slot fp-8 to store a value of type T4
4) I am casting ctx+8 to a pointer type T5 before dereferencing it
Currently we have (1) and this patch series adds (2), both done
 through records that look like they are just defining a type (i.e.
 the first class of record) but have 'magic' semantics (in the case
 of (1), special names of the form btf_map_foo.  How anyone
 thought that was a clean and tasteful design is beyond me.)
What IMHO the design *should* be, is that we have a 'types'
 subsection that *only* contains records of the first class, and
 then other subsections to hold records of the second class that
 reference records of the first class by ID.  So for (1) you'd have
 either additional fields in struct bpf_map_def (we've extended that
 several times before, after all), or you'd have a maps table in
 .BTF that links map names ("foo", not "btf_map_foo"!) with type
 IDs for its key and value:
    struct btf_map_record {
        __u32 name_off; /* name of map */
        __u32 key_type_id; /* index in "types" table */
        __u32 value_type_id; /* ditto */
    }
(Note the absence of any meaningless struct type as created by
 BPF_ANNOTATE_KV_PAIR.  That kind of source-level hack should be
 converted by the compiler's BTF output module into something less
 magic, rather than baked into the format definition.)
Then for (2) you'd have a functions table in .BTF that links subprog
 names, start offsets, and signatures/prototypes:
    struct btf_func_record {
    __u32 name_off; /* name of function */
    __u16 subprog_secn; /* section index in which func appears */
    __u16 subprog_start; /* offset in section of func entry point */
    __u32 type_id; /* index in "types" table of func signature */
    }

I believe this is a much cleaner design, which will be easier to extend
 in the future to add things like (3) and (4) for source-line-level
 debug information.  I also believe that if someone had written
 documentation describing the original design, semantics of the various
 BTF records, etc., it would have been immediately obvious that the
 design was needlessly confusing and ad-hoc.

On 20/10/18 00:27, Martin Lau wrote:
> Like struct, the member's names of struct is part of the btf_type.
> A struct with the same member's types but different member's names
> is a different btf_type.
Yes, but that's not what I'm talking about.  I'm talking about structs
 with the same member names, but with different names of the structs.
As in the following C snippet:

struct foo {
    int i;
};

int main(void)
{
    struct foo x;
    struct foo y;

    x.i = 0
    y.i = x.i;

    return y.i;
}

We have one type 'struct foo' (name "foo"), but two _instances_ of
 that type (names "x", "y").  We *cannot* use a single BTF record to
 express both "x" and its type, because its type has a name of its
 own ("foo") and there is only room in struct btf_type for one name.
Thus we must have one record for the instance "x" and another record
 for the type "foo", with the former referencing the latter.

> Having two id spaces for debug-info is confusing.  They are
> all debug-info at the end.
But they have different semantics!  Just because you have a term,
 "debug-info", that's defined to cover both, doesn't mean that they
 are the same thing.  You might as well say that passport numbers and
 telephone numbers should be drawn from the same numbering space,
 because they're both "personal information", and never mind that one
 identifies a person and the other identifies a telephone.
It's having the _same_ id space for entities that are almost, but not
 quite, entirely unlike each other that's confusing.

-Ed

Re: [PATCH iproute2 net-next 0/3] ss: Allow selection of columns to be displayed

2018-11-01 Thread David Ahern

On 11/1/18 3:06 PM, Jakub Kicinski wrote:
> On Wed, 31 Oct 2018 20:48:05 -0600, David Ahern wrote:
>>>   spacing with a special character in the format string, that is:
>>>
>>> "%S.%Qr.%Qs  %Al:%Pl %Ar:%Pr  %p\n"
>>>
>>>   would mean "align everything to the right, distribute remaining
>>>   whitespace between %S, %Qr and %Qs". But it looks rather complicated
>>>   at a glance.
>>>   
>>
>> My concern here is that once this goes in for 1 command, the others in
>> iproute2 need to follow suit - meaning same syntax style for all
>> commands. Given that I'd prefer we get a reasonable consensus on syntax
>> that will work across commands -- ss, ip, tc. If it is as simple as
>> column names with a fixed order, that is fine but just give proper
>> consideration given the impact.
> 
> FWIW I just started piping iproute2 commands to jq.  Example:
> 
> tc -s -j qdisc show dev em1 | \
>   jq -r '.[] |  
> [.kind,.parent,.handle,.offloaded,.bytes,.packets,.drops,.overlimits,.requeues,.backlog,.qlen,.marked]
>  | @tsv'
> 
> JSONification would probably be quite an undertaking for ss :(
> 

Right, that is used in some of the scripts under
tools/testing/selftests. I would put that in the 'heavyweight solution'
category.

A number of key commands offer the capability to control the output via
command line argument (e.g., ps, perf script). Given the amount of data
iproute2 commands throw at a user by default, it would be a good
usability feature to allow a user to customize the output without having
to pipe it into other commands.

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic