from:"Paweł Staszewski"





W dniu 10.11.2018 o 23:06, Jesper Dangaard Brouer pisze:

On Sat, 10 Nov 2018 20:56:02 +0100
Paweł Staszewski  wrote:


W dniu 10.11.2018 o 20:49, Paweł Staszewski pisze:


W dniu 10.11.2018 o 20:34, Jesper Dangaard Brouer pisze:

On Fri, 9 Nov 2018 23:20:38 +0100 Paweł Staszewski
 wrote:
  

W dniu 08.11.2018 o 20:12, Paweł Staszewski pisze:

CPU load is lower than for connectx4 - but it looks like bandwidth
limit is the same :)
But also after reaching 60Gbit/60Gbit

   bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
    input: /proc/net/dev type: rate
    - iface   Rx Tx Total
===


   enp175s0:  45.09 Gb/s  15.09 Gb/s 60.18 Gb/s
   enp216s0:  15.14 Gb/s  45.19 Gb/s 60.33 Gb/s
---


  total:  60.45 Gb/s  60.48 Gb/s 120.93 Gb/s

Today reached 65/65Gbit/s

But starting from 60Gbit/s RX / 60Gbit TX nics start to drop packets
(with 50%CPU on all 28cores) - so still there is cpu power to use :).

This is weird!

How do you see / measure these drops?

Simple icmp test like ping -i 0.1
And im testing by icmp management ip address on vlan that is attacked
to one NIC (the side that is more stressed with RX)
And another icmp test is forward thru this router - host behind it

Both measurements shows same loss ratio from 0.1 to 0.5% after
reaching ~45Gbit/s RX side - depends how much RX side is pushed drops
vary between 0.1 to 0.5 - even 0.6%:)


Okay good to know, you use an external measurement for this.  I do
think packets are getting dropped by the NIC.


So checked other stats.
softnet_stats shows average 1k squeezed per sec:

Is below output the raw counters? not per sec?

It would be valuable to see the per sec stats instead...
I use this tool:
https://github.com/netoptimizer/network-testing/blob/master/bin/softnet_stat.pl

CPU  total/sec dropped/sec    squeezed/sec  collision/sec  
rx_rps/sec  flow_limit/sec
CPU:00   0   0   0 0    0   
    0

[...]

CPU:13   0   0   0 0    0   
    0
CPU:14  485538   0  43 0    0   
    0
CPU:15  474794   0  51 0    0   
    0
CPU:16  449322   0  41 0    0   
    0
CPU:17  476420   0  46 0    0   
    0
CPU:18  440436   0  38 0    0   
    0
CPU:19  501499   0  49 0    0   
    0
CPU:20  459468   0  49 0    0   
    0
CPU:21  438928   0  47 0    0   
    0
CPU:22  468983   0  40 0    0   
    0
CPU:23  446253   0  47 0    0   
    0
CPU:24  451909   0  46 0    0   
    0
CPU:25  479373   0  55 0    0   
    0
CPU:26  467848   0  49 0    0   
    0
CPU:27  453153   0  51 0    0   
    0
CPU:28   0   0   0 0    0   
    0

[...]

CPU:40   0   0   0 0    0   
    0
CPU:41   0   0   0 0    0   
    0
CPU:42  466853   0  43 0    0   
    0
CPU:43  453059   0  54 0    0   
    0
CPU:44  363219   0  34 0    0   
    0
CPU:45  353632   0  38 0    0   
    0
CPU:46  371618   0  40 0    0   
    0
CPU:47  350518   0  46 0    0   
    0
CPU:48  397544   0  40 0    0   
    0
CPU:49  364873   0  38 0    0   
    0
CPU:50  383630   0  38 0    0   
    0
CPU:51  358771   0  39 0    0   
    0
CPU:52  372547   0  38 0    0   
    0
CPU:53  372882   0  36 0    0   
    0
CPU:54  366244   0  43 0

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic





W dniu 10.11.2018 o 22:53, Paweł Staszewski pisze:



W dniu 10.11.2018 o 22:01, Jesper Dangaard Brouer pisze:

On Sat, 10 Nov 2018 21:02:10 +0100
Paweł Staszewski  wrote:


W dniu 10.11.2018 o 20:34, Jesper Dangaard Brouer pisze:

I want you to experiment with:

   ethtool --set-priv-flags DEVICE rx_striding_rq off

just checked that previously connectx4 was have thos disabled:
   ethtool --show-priv-flags enp175s0f0

Private flags for enp175s0f0:
rx_cqe_moder   : on
tx_cqe_moder   : off
rx_cqe_compress    : off
rx_striding_rq : off
rx_no_csum_complete: off


The CX4 hardware does not have this feature (p.s. the CX4-Lx does).


So now we are on connectx5 and we have enabled - for sure connectx5
changed cpu load - where i have now max 50/60% cpu where with connectx4
there was sometimes near 100% with same configuration.

I (strongly) believe the CPU load was related to the page-alloactor
lock congestion, that Aaron fixed.

Yes i think both - most problems with cpu was due to page-allocator 
problems.
But also after change connctx4 to connectx5 there is cpu load 
difference - about 10% in total - but yes most of this like 40% is 
cause of Aaron patch :) - rly good job :)



Now im messing with ring configuration for connectx5 nics.
And after reading that paper:
https://netdevconf.org/2.1/slides/apr6/network-performance/ 
04-amir-RX_and_TX_bulking_v2.pdf


changed from RX:8192 / TX: 4096 to RX:8192 / TX: 256

after this i gain about 5Gbit/s RX and TX traffic and less cpu load
before change there was 59/59 Gbit/s

After change there is 64/64 Gbit/s

 bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
  input: /proc/net/dev type: rate
  | iface   Rx Tx    Total
== 

 enp175s0:  44.45 Gb/s   19.69 Gb/s   
64.14 Gb/s
 enp216s0:  19.69 Gb/s   44.49 Gb/s   
64.19 Gb/s
-- 


    total:  64.14 Gb/s   64.18 Gb/s 128.33 Gb/s



Also after this change kernel freed some memory... like 500MB

Still squeezed but less with more traffic...

CPU  total/sec dropped/sec    squeezed/sec 
collision/sec  rx_rps/sec  flow_limit/sec
CPU:00   0   0   0 0   
0   0
CPU:01   0   0   0 0   
0   0
CPU:02   0   0   0 0   
0   0
CPU:03   0   0   0 0   
0   0
CPU:04   0   0   0 0   
0   0
CPU:05   0   0   0 0   
0   0
CPU:06   0   0   0 0   
0   0
CPU:07   0   0   0 0   
0   0
CPU:08   0   0   0 0   
0   0
CPU:09   0   0   0 0   
0   0
CPU:10   0   0   0 0   
0   0
CPU:11   0   0   0 0   
0   0
CPU:12   0   0   0 0   
0   0
CPU:13   0   0   0 0   
0   0
CPU:14  389270   0  41 0   
0   0
CPU:15  375543   0  32 0   
0   0
CPU:16  385847   0  22 0   
0   0
CPU:17  412293   0  34 0   
0   0
CPU:18  401287   0  30 0   
0   0
CPU:19  368345   0  30 0   
0   0
CPU:20  395452   0  28 0   
0   0
CPU:21  374032   0  38 0   
0   0
CPU:22  342036   0  32 0   
0   0
CPU:23  374773   0  34 0   
0   0
CPU:24  356139   0  31 0   
0   0
CPU:25  392725   0  32 0   
0   0
CPU:26  385937   0  37 0   
0   0
CPU:27  385282   0  37 0   
0   0
CPU:28   0   0   0 0   
0   0
CPU:29

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic





W dniu 10.11.2018 o 22:01, Jesper Dangaard Brouer pisze:

On Sat, 10 Nov 2018 21:02:10 +0100
Paweł Staszewski  wrote:


W dniu 10.11.2018 o 20:34, Jesper Dangaard Brouer pisze:

I want you to experiment with:

   ethtool --set-priv-flags DEVICE rx_striding_rq off

just checked that previously connectx4 was have thos disabled:
   ethtool --show-priv-flags enp175s0f0

Private flags for enp175s0f0:
rx_cqe_moder   : on
tx_cqe_moder   : off
rx_cqe_compress    : off
rx_striding_rq : off
rx_no_csum_complete: off


The CX4 hardware does not have this feature (p.s. the CX4-Lx does).

  

So now we are on connectx5 and we have enabled - for sure connectx5
changed cpu load - where i have now max 50/60% cpu where with connectx4
there was sometimes near 100% with same configuration.

I (strongly) believe the CPU load was related to the page-alloactor
lock congestion, that Aaron fixed.

Yes i think both - most problems with cpu was due to page-allocator 
problems.
But also after change connctx4 to connectx5 there is cpu load difference 
- about 10% in total - but yes most of this like 40% is cause of Aaron 
patch :) - rly good job :)



Now im messing with ring configuration for connectx5 nics.
And after reading that paper:
https://netdevconf.org/2.1/slides/apr6/network-performance/ 
04-amir-RX_and_TX_bulking_v2.pdf


changed from RX:8192 / TX: 4096 to RX:8192 / TX: 256

after this i gain about 5Gbit/s RX and TX traffic and less cpu load
before change there was 59/59 Gbit/s

After change there is 64/64 Gbit/s

 bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
  input: /proc/net/dev type: rate
  | iface   Rx Tx    Total
==
 enp175s0:  44.45 Gb/s   19.69 Gb/s   
64.14 Gb/s
 enp216s0:  19.69 Gb/s   44.49 Gb/s   
64.19 Gb/s

--
    total:  64.14 Gb/s   64.18 Gb/s 128.33 Gb/s

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic





W dniu 10.11.2018 o 20:34, Jesper Dangaard Brouer pisze:

I want you to experiment with:

  ethtool --set-priv-flags DEVICE rx_striding_rq off

just checked that previously connectx4 was have thos disabled:
 ethtool --show-priv-flags enp175s0f0
Private flags for enp175s0f0:
rx_cqe_moder   : on
tx_cqe_moder   : off
rx_cqe_compress    : off
rx_striding_rq : off
rx_no_csum_complete: off


So now we are on connectx5 and we have enabled - for sure connectx5 
changed cpu load - where i have now max 50/60% cpu where with connectx4 
there was sometimes near 100% with same configuration.

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic





W dniu 10.11.2018 o 20:49, Paweł Staszewski pisze:



W dniu 10.11.2018 o 20:34, Jesper Dangaard Brouer pisze:
On Fri, 9 Nov 2018 23:20:38 +0100 Paweł Staszewski 
 wrote:



W dniu 08.11.2018 o 20:12, Paweł Staszewski pisze:

CPU load is lower than for connectx4 - but it looks like bandwidth
limit is the same :)
But also after reaching 60Gbit/60Gbit

  bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
   input: /proc/net/dev type: rate
   - iface   Rx Tx Total
== 



  enp175s0:  45.09 Gb/s   15.09 Gb/s 
60.18 Gb/s
  enp216s0:  15.14 Gb/s   45.19 Gb/s 
60.33 Gb/s
-- 



 total:  60.45 Gb/s   60.48 Gb/s 120.93 
Gb/s

Today reached 65/65Gbit/s

But starting from 60Gbit/s RX / 60Gbit TX nics start to drop packets
(with 50%CPU on all 28cores) - so still there is cpu power to use :).

This is weird!

How do you see / measure these drops?

Simple icmp test like ping -i 0.1
And im testing by icmp management ip address on vlan that is attacked 
to one NIC (the side that is more stressed with RX)

And another icmp test is forward thru this router - host behind it

Both measurements shows same loss ratio from 0.1 to 0.5% after 
reaching ~45Gbit/s RX side - depends how much RX side is pushed drops 
vary between 0.1 to 0.5 - even 0.6%:)







So checked other stats.
softnet_stats shows average 1k squeezed per sec:

Is below output the raw counters? not per sec?

It would be valuable to see the per sec stats instead...
I use this tool:
https://github.com/netoptimizer/network-testing/blob/master/bin/softnet_stat.pl
CPU  total/sec dropped/sec    squeezed/sec 
collision/sec  rx_rps/sec  flow_limit/sec
CPU:00   0   0   0 0   
0   0
CPU:01   0   0   0 0   
0   0
CPU:02   0   0   0 0   
0   0
CPU:03   0   0   0 0   
0   0
CPU:04   0   0   0 0   
0   0
CPU:05   0   0   0 0   
0   0
CPU:06   0   0   0 0   
0   0
CPU:07   0   0   0 0   
0   0
CPU:08   0   0   0 0   
0   0
CPU:09   0   0   0 0   
0   0
CPU:10   0   0   0 0   
0   0
CPU:11   0   0   0 0   
0   0
CPU:12   0   0   0 0   
0   0
CPU:13   0   0   0 0   
0   0
CPU:14  485538   0  43 0   
0   0
CPU:15  474794   0  51 0   
0   0
CPU:16  449322   0  41 0   
0   0
CPU:17  476420   0  46 0   
0   0
CPU:18  440436   0  38 0   
0   0
CPU:19  501499   0  49 0   
0   0
CPU:20  459468   0  49 0   
0   0
CPU:21  438928   0  47 0   
0   0
CPU:22  468983   0  40 0   
0   0
CPU:23  446253   0  47 0   
0   0
CPU:24  451909   0  46 0   
0   0
CPU:25  479373   0  55 0   
0   0
CPU:26  467848   0  49 0   
0   0
CPU:27  453153   0  51 0   
0   0
CPU:28   0   0   0 0   
0   0
CPU:29   0   0   0 0   
0   0
CPU:30   0   0   0 0   
0   0
CPU:31   0   0   0 0   
0   0
CPU:32   0   0   0 0   
0   0
CPU:33   0   0   0 0   
0   0
CPU:34   0

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic





W dniu 10.11.2018 o 20:34, Jesper Dangaard Brouer pisze:

On Fri, 9 Nov 2018 23:20:38 +0100 Paweł Staszewski  
wrote:


W dniu 08.11.2018 o 20:12, Paweł Staszewski pisze:

CPU load is lower than for connectx4 - but it looks like bandwidth
limit is the same :)
But also after reaching 60Gbit/60Gbit

  bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
   input: /proc/net/dev type: rate
   - iface   Rx Tx    Total
==

  enp175s0:  45.09 Gb/s   15.09 Gb/s 60.18 Gb/s
  enp216s0:  15.14 Gb/s   45.19 Gb/s 60.33 Gb/s
--

     total:  60.45 Gb/s   60.48 Gb/s 120.93 Gb/s

Today reached 65/65Gbit/s

But starting from 60Gbit/s RX / 60Gbit TX nics start to drop packets
(with 50%CPU on all 28cores) - so still there is cpu power to use :).

This is weird!

How do you see / measure these drops?

Simple icmp test like ping -i 0.1
And im testing by icmp management ip address on vlan that is attacked to 
one NIC (the side that is more stressed with RX)

And another icmp test is forward thru this router - host behind it

Both measurements shows same loss ratio from 0.1 to 0.5% after reaching 
~45Gbit/s RX side - depends how much RX side is pushed drops vary 
between 0.1 to 0.5 - even 0.6%:)







So checked other stats.
softnet_stats shows average 1k squeezed per sec:

Is below output the raw counters? not per sec?

It would be valuable to see the per sec stats instead...
I use this tool:
  
https://github.com/netoptimizer/network-testing/blob/master/bin/softnet_stat.pl


cpu  total    dropped   squeezed  collision    rps flow_limit
    0  18554  0  1  0  0 0
    1  16728  0  1  0  0 0
    2  18033  0  1  0  0 0
    3  17757  0  1  0  0 0
    4  18861  0  0  0  0 0
    5  0  0  1  0  0 0
    6  2  0  1  0  0 0
    7  0  0  1  0  0 0
    8  0  0  0  0  0 0
    9  0  0  1  0  0 0
   10  0  0  0  0  0 0
   11  0  0  1  0  0 0
   12 50  0  1  0  0 0
   13    257  0  0  0  0 0
   14 3629115363  0    3353259  0  0 0
   15  255167835  0    3138271  0  0 0
   16 4240101961  0    3036130  0  0 0
   17  599810018  0    3072169  0  0 0
   18  432796524  0    3034191  0  0 0
   19   41803906  0    3037405  0  0 0
   20  900382666  0    3112294  0  0 0
   21  620926085  0    3086009  0  0 0
   22   41861198  0    3023142  0  0 0
   23 4090425574  0    2990412  0  0 0
   24 4264870218  0    3010272  0  0 0
   25  141401811  0    3027153  0  0 0
   26  104155188  0    3051251  0  0 0
   27 4261258691  0    3039765  0  0 0
   28  4  0  1  0  0 0
   29  4  0  0  0  0 0
   30  0  0  1  0  0 0
   31  0  0  0  0  0 0
   32  3  0  1  0  0 0
   33  1  0  1  0  0 0
   34  0  0  1  0  0 0
   35  0  0  0  0  0 0
   36  0  0  1  0  0 0
   37  0  0  1  0  0 0
   38  0  0  1  0  0 0
   39  0  0  1  0  0 0
   40  0  0  0  0  0 0
   41  0  0  1  0  0 0
   42  299758202  0    3139693  0  0 0
   43 4254727979  0    3103577  0  0 0
   44 195943  0    2554885  0  0 0
   45 1675702723  0    2513481  0  0 0
   46 1908435503  0    2519698  0  0 0
   47 1877799710  0    2537768  0  0 0
   48 2384274076  0    2584673  0  0 0
   49 2598104878  0    2593616  0  0 0
   50 1897566829

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic





W dniu 10.11.2018 o 01:06, David Ahern pisze:

On 11/9/18 9:21 AM, David Ahern wrote:

Is there possible to add only counters from xdp for vlans ?
This will help me in testing.

I will take a look today at adding counters that you can dump using
bpftool. It will be a temporary solution for this xdp program only.


Same tree, kernel-tables-wip-02 branch. Compile kernel and install.
Compile samples as before.

If you give the userspace program a -t arg, it loop showing stats.
Ctrl-C to break. The xdp programs are not detached on exit.

Example:

./xdp_fwd -t 5 eth1 eth2 eth3 eth4

15:59:32:   rx tx  dropped  skippedl3_devfib_dev
index  3:   901158 9011580   18 0  0
index  4:   901159 9011580   20 0 901139
index 10:0  00019 19
index 11:0  000901139 901139
index 15:0  00019 19
index 16:0  000901139  0

Rx and Tx counters are for the physical port.

VLANs show up as l3_dev (ingress) and fib_dev (egress).

dropped is anytime the xdp program returns XDP_DROP (e.g., invalid packet)

skipped is anytime the program returns XDP_PASS (e.g., not ipv4 or ipv6,
local traffic, or needs full stack assist).


recompiled new version but:

./xdp_fwd enp175s0f0 enp175s0f1
libbpf: failed to create map (name: 'stats_map'): Operation not permitted
libbpf: failed to load object './xdp_fwd_kern.o'

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic





W dniu 08.11.2018 o 20:12, Paweł Staszewski pisze:
CPU load is lower than for connectx4 - but it looks like bandwidth 
limit is the same :)

But also after reaching 60Gbit/60Gbit

 bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
  input: /proc/net/dev type: rate
  - iface   Rx Tx    Total
== 

 enp175s0:  45.09 Gb/s   15.09 Gb/s   
60.18 Gb/s
 enp216s0:  15.14 Gb/s   45.19 Gb/s   
60.33 Gb/s
-- 

    total:  60.45 Gb/s   60.48 Gb/s 120.93 Gb/s 


Today reached 65/65Gbit/s

But starting from 60Gbit/s RX / 60Gbit TX nics start to drop packets 
(with 50%CPU on all 28cores) - so still there is cpu power to use :).


So checked other stats.
softnet_stats shows average 1k squeezed per sec:
cpu  total    dropped   squeezed  collision    rps flow_limit
  0  18554  0  1  0  0 0
  1  16728  0  1  0  0 0
  2  18033  0  1  0  0 0
  3  17757  0  1  0  0 0
  4  18861  0  0  0  0 0
  5  0  0  1  0  0 0
  6  2  0  1  0  0 0
  7  0  0  1  0  0 0
  8  0  0  0  0  0 0
  9  0  0  1  0  0 0
 10  0  0  0  0  0 0
 11  0  0  1  0  0 0
 12 50  0  1  0  0 0
 13    257  0  0  0  0 0
 14 3629115363  0    3353259  0  0 0
 15  255167835  0    3138271  0  0 0
 16 4240101961  0    3036130  0  0 0
 17  599810018  0    3072169  0  0 0
 18  432796524  0    3034191  0  0 0
 19   41803906  0    3037405  0  0 0
 20  900382666  0    3112294  0  0 0
 21  620926085  0    3086009  0  0 0
 22   41861198  0    3023142  0  0 0
 23 4090425574  0    2990412  0  0 0
 24 4264870218  0    3010272  0  0 0
 25  141401811  0    3027153  0  0 0
 26  104155188  0    3051251  0  0 0
 27 4261258691  0    3039765  0  0 0
 28  4  0  1  0  0 0
 29  4  0  0  0  0 0
 30  0  0  1  0  0 0
 31  0  0  0  0  0 0
 32  3  0  1  0  0 0
 33  1  0  1  0  0 0
 34  0  0  1  0  0 0
 35  0  0  0  0  0 0
 36  0  0  1  0  0 0
 37  0  0  1  0  0 0
 38  0  0  1  0  0 0
 39  0  0  1  0  0 0
 40  0  0  0  0  0 0
 41  0  0  1  0  0 0
 42  299758202  0    3139693  0  0 0
 43 4254727979  0    3103577  0  0 0
 44 195943  0    2554885  0  0 0
 45 1675702723  0    2513481  0  0 0
 46 1908435503  0    2519698  0  0 0
 47 1877799710  0    2537768  0  0 0
 48 2384274076  0    2584673  0  0 0
 49 2598104878  0    2593616  0  0 0
 50 1897566829  0    2530857  0  0 0
 51 1712741629  0    2489089  0  0 0
 52 1704033648  0    2495892  0  0 0
 53 1636781820  0    2499783  0  0 0
 54 1861997734  0    2541060  0  0 0
 55 2113521616  0    2555673  0  0 0


So i rised netdev backlog and budged to rly high values
524288 for netdev_budget and same for backlog

This rised sortirqs from about 600k/sec to 800k/sec for NET_TX/NET_RX

But after this changes i have less packets drops.


Below perf top from max traffic reached:
   PerfTop:   72230 irqs/sec  kernel:99.4%  exact:  0.0% [4000Hz 
cycles],  (al

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic





W dniu 09.11.2018 o 17:21, David Ahern pisze:

On 11/9/18 3:20 AM, Paweł Staszewski wrote:

I just catch some weird behavior :)
All was working fine for about 20k packets

Then after xdp start to forward every 10 packets

Interesting. Any counter showing drops?

nothing that will fit

NIC statistics:
 rx_packets: 187041
 rx_bytes: 10600954
 tx_packets: 40316
 tx_bytes: 16526844
 tx_tso_packets: 797
 tx_tso_bytes: 3876084
 tx_tso_inner_packets: 0
 tx_tso_inner_bytes: 0
 tx_added_vlan_packets: 38391
 tx_nop: 2
 rx_lro_packets: 0
 rx_lro_bytes: 0
 rx_ecn_mark: 0
 rx_removed_vlan_packets: 187041
 rx_csum_unnecessary: 0
 rx_csum_none: 150011
 rx_csum_complete: 37030
 rx_csum_unnecessary_inner: 0
 rx_xdp_drop: 0
 rx_xdp_redirect: 64893
 rx_xdp_tx_xmit: 0
 rx_xdp_tx_full: 0
 rx_xdp_tx_err: 0
 rx_xdp_tx_cqe: 0
 tx_csum_none: 2468
 tx_csum_partial: 35955
 tx_csum_partial_inner: 0
 tx_queue_stopped: 0
 tx_queue_dropped: 0
 tx_xmit_more: 0
 tx_recover: 0
 tx_cqes: 38423
 tx_queue_wake: 0
 tx_udp_seg_rem: 0
 tx_cqe_err: 0
 tx_xdp_xmit: 0
 tx_xdp_full: 0
 tx_xdp_err: 0
 tx_xdp_cqes: 0
 rx_wqe_err: 0
 rx_mpwqe_filler_cqes: 0
 rx_mpwqe_filler_strides: 0
 rx_buff_alloc_err: 0
 rx_cqe_compress_blks: 0
 rx_cqe_compress_pkts: 0
 rx_page_reuse: 0
 rx_cache_reuse: 186302
 rx_cache_full: 0
 rx_cache_empty: 666768
 rx_cache_busy: 174
 rx_cache_waive: 0
 rx_congst_umr: 0
 rx_arfs_err: 0
 ch_events: 249320
 ch_poll: 249321
 ch_arm: 249001
 ch_aff_change: 0
 ch_eq_rearm: 0
 rx_out_of_buffer: 0
 rx_if_down_packets: 57
 rx_vport_unicast_packets: 142659
 rx_vport_unicast_bytes: 42706914
 tx_vport_unicast_packets: 40167
 tx_vport_unicast_bytes: 16668096
 rx_vport_multicast_packets: 39188170
 rx_vport_multicast_bytes: 3466527450
 tx_vport_multicast_packets: 58
 tx_vport_multicast_bytes: 4556
 rx_vport_broadcast_packets: 16343520
 rx_vport_broadcast_bytes: 1031334602
 tx_vport_broadcast_packets: 91
 tx_vport_broadcast_bytes: 5460
 rx_vport_rdma_unicast_packets: 0
 rx_vport_rdma_unicast_bytes: 0
 tx_vport_rdma_unicast_packets: 0
 tx_vport_rdma_unicast_bytes: 0
 rx_vport_rdma_multicast_packets: 0
 rx_vport_rdma_multicast_bytes: 0
 tx_vport_rdma_multicast_packets: 0
 tx_vport_rdma_multicast_bytes: 0
 tx_packets_phy: 40316
 rx_packets_phy: 55674361
 rx_crc_errors_phy: 0
 tx_bytes_phy: 16839376
 rx_bytes_phy: 4763267396
 tx_multicast_phy: 58
 tx_broadcast_phy: 91
 rx_multicast_phy: 39188180
 rx_broadcast_phy: 16343521
 rx_in_range_len_errors_phy: 0
 rx_out_of_range_len_phy: 0
 rx_oversize_pkts_phy: 0
 rx_symbol_err_phy: 0
 tx_mac_control_phy: 0
 rx_mac_control_phy: 0
 rx_unsupported_op_phy: 0
 rx_pause_ctrl_phy: 0
 tx_pause_ctrl_phy: 0
 rx_discards_phy: 1
 tx_discards_phy: 0
 tx_errors_phy: 0
 rx_undersize_pkts_phy: 0
 rx_fragments_phy: 0
 rx_jabbers_phy: 0
 rx_64_bytes_phy: 3792455
 rx_65_to_127_bytes_phy: 51821620
 rx_128_to_255_bytes_phy: 37669
 rx_256_to_511_bytes_phy: 1481
 rx_512_to_1023_bytes_phy: 434
 rx_1024_to_1518_bytes_phy: 694
 rx_1519_to_2047_bytes_phy: 20008
 rx_2048_to_4095_bytes_phy: 0
 rx_4096_to_8191_bytes_phy: 0
 rx_8192_to_10239_bytes_phy: 0
 link_down_events_phy: 0
 rx_pcs_symbol_err_phy: 0
 rx_corrected_bits_phy: 6
 rx_err_lane_0_phy: 0
 rx_err_lane_1_phy: 0
 rx_err_lane_2_phy: 0
 rx_err_lane_3_phy: 6
 rx_buffer_passed_thres_phy: 0
 rx_pci_signal_integrity: 0
 tx_pci_signal_integrity: 82
 outbound_pci_stalled_rd: 0
 outbound_pci_stalled_wr: 0
 outbound_pci_stalled_rd_events: 0
 outbound_pci_stalled_wr_events: 0
 rx_prio0_bytes: 4144920388
 rx_prio0_packets: 48310037
 tx_prio0_bytes: 16839376
 tx_prio0_packets: 40316
 rx_prio1_bytes: 481032
 rx_prio1_packets: 7074
 tx_prio1_bytes: 0
 tx_prio1_packets: 0
 rx_prio2_bytes: 9074194
 rx_prio2_packets: 106207
 tx_prio2_bytes: 0
 tx_prio2_packets: 0
 rx_prio3_bytes: 0
 rx_prio3_packets: 0
 tx_prio3_bytes: 0
 tx_prio3_packets: 0
 rx_prio4_bytes: 0
 rx_prio4_packets: 0
 tx_prio4_bytes: 0
 tx_prio4_packets: 0
 rx_prio5_bytes: 0
 rx_prio5_packets: 0
 tx_prio5_bytes: 0
 tx_prio5_packets: 0
 rx_prio6_bytes: 371961810
 rx_prio6_packets: 4006281
 tx_prio6_bytes: 0
 tx_prio6_packets: 0
 rx_prio7_bytes: 236830040
 rx_prio7_packets: 3244761
 tx_prio7_bytes: 0
 tx_prio7_packets: 0
 tx_pause_storm_warning_events : 0
 tx_pause_storm_error_events: 0
 module_unplug: 0
 module_bus_stuck: 0
 module_high_temp: 0
 module_bad_shorted: 0

NIC

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic





W dniu 08.11.2018 o 17:06, David Ahern pisze:

On 11/8/18 6:33 AM, Paweł Staszewski wrote:


W dniu 07.11.2018 o 22:06, David Ahern pisze:

On 11/3/18 6:24 PM, Paweł Staszewski wrote:

Does your setup have any other device types besides physical ports with
VLANs (e.g., any macvlans or bonds)?



no.
just
phy(mlnx)->vlans only config

VLAN and non-VLAN (and a mix) seem to work ok. Patches are here:
     https://github.com/dsahern/linux.git bpf/kernel-tables-wip

I got lazy with the vlan exports; right now it requires 8021q to be
builtin (CONFIG_VLAN_8021Q=y)

You can use the xdp_fwd sample:
    make O=kbuild -C samples/bpf -j 8

Copy samples/bpf/xdp_fwd_kern.o and samples/bpf/xdp_fwd to the server
and run:
     ./xdp_fwd 

e.g., in my testing I run:
     xdp_fwd eth1 eth2 eth3 eth4

All of the relevant forwarding ports need to be on the same command
line. This version populates a second map to verify the egress port has
XDP enabled.

Installed today on some lab server with mellanox connectx4

And trying some simple static routing first - but after enabling xdp
program - receiver is not receiving frames

Route table is simple as possible for tests :)

icmp ping test send from 192.168.22.237 to 172.16.0.2 - incomming
packets on vlan 4081

ip r
default via 192.168.22.236 dev vlan4081
172.16.0.0/30 dev vlan1740 proto kernel scope link src 172.16.0.1
192.168.22.0/24 dev vlan4081 proto kernel scope link src 192.168.22.205

neigh table:
ip neigh ls

192.168.22.237 dev vlan4081 lladdr 00:25:90:fb:a6:8d REACHABLE
172.16.0.2 dev vlan1740 lladdr ac:1f:6b:2c:2e:5a REACHABLE

and interfaces:
4: enp175s0f0:  mtu 1500 qdisc mq state
UP mode DEFAULT group default qlen 1000
     link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
5: enp175s0f1:  mtu 1500 qdisc mq state
UP mode DEFAULT group default qlen 1000
     link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
6: vlan4081@enp175s0f0:  mtu 1500 qdisc
noqueue state UP mode DEFAULT group default qlen 1000
     link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
7: vlan1740@enp175s0f1:  mtu 1500 qdisc
noqueue state UP mode DEFAULT group default qlen 1000
     link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff

5: enp175s0f1:  mtu 1500 xdp/id:5 qdisc
mq state UP group default qlen 1000
     link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
     inet6 fe80::ae1f:6bff:fe07:c891/64 scope link
    valid_lft forever preferred_lft forever
6: vlan4081@enp175s0f0:  mtu 1500 qdisc
noqueue state UP group default qlen 1000
     link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
     inet 192.168.22.205/24 scope global vlan4081
    valid_lft forever preferred_lft forever
     inet6 fe80::ae1f:6bff:fe07:c890/64 scope link
    valid_lft forever preferred_lft forever
7: vlan1740@enp175s0f1:  mtu 1500 qdisc
noqueue state UP group default qlen 1000
     link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
     inet 172.16.0.1/30 scope global vlan1740
    valid_lft forever preferred_lft forever
     inet6 fe80::ae1f:6bff:fe07:c891/64 scope link
    valid_lft forever preferred_lft forever


xdp program detached:
Receiving side tcpdump:
14:28:09.141233 IP 192.168.22.237 > 172.16.0.2: ICMP echo request, id
30227, seq 487, length 64

I can see icmp requests


enabling xdp
./xdp_fwd enp175s0f1 enp175s0f0

4: enp175s0f0:  mtu 1500 xdp qdisc mq
state UP mode DEFAULT group default qlen 1000
     link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
     prog/xdp id 5 tag 3c231ff1e5e77f3f
5: enp175s0f1:  mtu 1500 xdp qdisc mq
state UP mode DEFAULT group default qlen 1000
     link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
     prog/xdp id 5 tag 3c231ff1e5e77f3f
6: vlan4081@enp175s0f0:  mtu 1500 qdisc
noqueue state UP mode DEFAULT group default qlen 1000
     link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
7: vlan1740@enp175s0f1:  mtu 1500 qdisc
noqueue state UP mode DEFAULT group default qlen 1000
     link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff


What hardware is this?

Start with:

echo 1 > /sys/kernel/debug/tracing/events/xdp/enable
cat /sys/kernel/debug/tracing/trace_pipe

>From there, you can check the FIB lookups:
sysctl -w kernel.perf_event_max_stack=16
perf record -e fib:* -a -g -- sleep 5
perf script



I just catch some weird behavior :)
All was working fine for about 20k packets

Then after xdp start to forward every 10 packets
ping 172.16.0.2 -i 0.1
PING 172.16.0.2 (172.16.0.2) 56(84) bytes of data.
64 bytes from 172.16.0.2: icmp_seq=1 ttl=64 time=5.12 ms
64 bytes from 172.16.0.2: icmp_seq=9 ttl=64 time=5.20 ms
64 bytes from 172.16.0.2: icmp_seq=19 ttl=64 time=4.85 ms
64 bytes from 172.16.0.2: icmp_seq=29 ttl=64 time=4.91 ms
64 bytes from 172.16.0.2: icmp_seq=38 ttl=64 time=4.85 ms
64 bytes from 172.16.0.2: icmp_seq=48 ttl=64 time=5.00 ms
^C
--- 172.16.0.2 ping statistics ---
55 packets transmitted, 6 received, 89% packet loss, time 5655ms
rtt min/avg/max/mdev = 4.850/4.992/5.203/0.145 ms


And again after some time back to normal

 ping 172.1

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic





W dniu 09.11.2018 o 05:52, Saeed Mahameed pisze:

On Thu, 2018-11-08 at 17:42 -0700, David Ahern wrote:

On 11/8/18 5:40 PM, Paweł Staszewski wrote:

W dniu 08.11.2018 o 17:32, David Ahern pisze:

On 11/8/18 9:27 AM, Paweł Staszewski wrote:

What hardware is this?


mellanox connectx 4
ethtool -i enp175s0f0
driver: mlx5_core
version: 5.0-0
firmware-version: 12.21.1000 (SM_200101033)
expansion-rom-version:
bus-info: :af:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

ethtool -i enp175s0f1
driver: mlx5_core
version: 5.0-0
firmware-version: 12.21.1000 (SM_200101033)
expansion-rom-version:
bus-info: :af:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes


Start with:

echo 1 > /sys/kernel/debug/tracing/events/xdp/enable
cat /sys/kernel/debug/tracing/trace_pipe

   cat /sys/kernel/debug/tracing/trace_pipe
   -0 [045] ..s. 68469.467752:
xdp_devmap_xmit:
ndo_xdp_xmit map_id=32 map_index=5 action=REDIRECT sent=0
drops=1
from_ifindex=4 to_ifindex=5 err=-6

FIB lookup is good, the redirect is happening, but the mlx5
driver does
not like it.

I think the -6 is coming from the mlx5 driver and the packet is
getting
dropped. Perhaps this check in mlx5e_xdp_xmit:

 if (unlikely(sq_num >= priv->channels.num))
  return -ENXIO;

I removed that part and recompiled - but after running now xdp_fwd
i
have kernel pamic :)

hh, no please don't do such thing :)

yes - dirty "try" :)
Code back in place :)




It must be because the tx netdev has less tx queues than the rx netdev.
or the rx netdev rings are bound to a high cpu indexes.

anyway, best practice is to open #cores RX/TX netdev on both sides

ethtool -L enp175s0f0  combined $(nproc)
ethtool -L enp175s0f1  combined $(nproc)

Ok now it is working.

Time for some tests :)

Thanks


Jesper or one of the Mellanox folks needs to respond about the config
needed to run XDP with this NIC. I don't have a 40G or 100G card to
play
with.

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic





W dniu 08.11.2018 o 17:32, David Ahern pisze:

On 11/8/18 9:27 AM, Paweł Staszewski wrote:

What hardware is this?


mellanox connectx 4
ethtool -i enp175s0f0
driver: mlx5_core
version: 5.0-0
firmware-version: 12.21.1000 (SM_200101033)
expansion-rom-version:
bus-info: :af:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

ethtool -i enp175s0f1
driver: mlx5_core
version: 5.0-0
firmware-version: 12.21.1000 (SM_200101033)
expansion-rom-version:
bus-info: :af:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes


Start with:

echo 1 > /sys/kernel/debug/tracing/events/xdp/enable
cat /sys/kernel/debug/tracing/trace_pipe

  cat /sys/kernel/debug/tracing/trace_pipe
  -0 [045] ..s. 68469.467752: xdp_devmap_xmit:
ndo_xdp_xmit map_id=32 map_index=5 action=REDIRECT sent=0 drops=1
from_ifindex=4 to_ifindex=5 err=-6

FIB lookup is good, the redirect is happening, but the mlx5 driver does
not like it.

I think the -6 is coming from the mlx5 driver and the packet is getting
dropped. Perhaps this check in mlx5e_xdp_xmit:

if (unlikely(sq_num >= priv->channels.num))
 return -ENXIO;
I removed that part and recompiled - but after running now xdp_fwd i 
have kernel pamic :)







swapper 0 [045] 68493.746274: fib:fib_table_lookup: table 254 oif
0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0
==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0
     7fff818c13b5 fib_table_lookup ([kernel.kallsyms])

swapper 0 [045] 68494.770287: fib:fib_table_lookup: table 254 oif
0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0
==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0
     7fff818c13b5 fib_table_lookup ([kernel.kallsyms])

swapper 0 [045] 68495.794304: fib:fib_table_lookup: table 254 oif
0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0
==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0
     7fff818c13b5 fib_table_lookup ([kernel.kallsyms])

swapper 0 [045] 68496.818308: fib:fib_table_lookup: table 254 oif
0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0
==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0
     7fff818c13b5 fib_table_lookup ([kernel.kallsyms])

swapper 0 [045] 68497.842313: fib:fib_table_lookup: table 254 oif
0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0
==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0
     7fff818c13b5 fib_table_lookup ([kernel.kallsyms])

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic





W dniu 03.11.2018 o 01:18, Paweł Staszewski pisze:



W dniu 01.11.2018 o 21:37, Saeed Mahameed pisze:

On Thu, 2018-11-01 at 12:09 +0100, Paweł Staszewski wrote:

W dniu 01.11.2018 o 10:50, Saeed Mahameed pisze:

On Wed, 2018-10-31 at 22:57 +0100, Paweł Staszewski wrote:

Hi

So maybee someone will be interested how linux kernel handles
normal
traffic (not pktgen :) )


Server HW configuration:

CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz

NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)


Server software:

FRR - as routing daemon

enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to
local
numa
node)

enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local
numa
node)


Maximum traffic that server can handle:

Bandwidth

    bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
 input: /proc/net/dev type: rate
 \ iface   Rx Tx    Total
=

=
  enp175s0f1:  28.51 Gb/s   37.24
Gb/s
65.74 Gb/s
  enp175s0f0:  38.07 Gb/s   28.44
Gb/s
66.51 Gb/s
---

---
   total:  66.58 Gb/s   65.67
Gb/s
132.25 Gb/s


Packets per second:

    bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
 input: /proc/net/dev type: rate
 - iface   Rx Tx    Total
=

=
  enp175s0f1:  5248589.00 P/s   3486617.75 P/s
8735207.00 P/s
  enp175s0f0:  3557944.25 P/s   5232516.00 P/s
8790460.00 P/s
---

---
   total:  8806533.00 P/s   8719134.00 P/s
17525668.00 P/s


After reaching that limits nics on the upstream side (more RX
traffic)
start to drop packets


I just dont understand that server can't handle more bandwidth
(~40Gbit/s is limit where all cpu's are 100% util) - where pps on
RX
side are increasing.


Where do you see 40 Gb/s ? you showed that both ports on the same
NIC (
same pcie link) are doing  66.58 Gb/s (RX) + 65.67 Gb/s (TX) =
132.25
Gb/s which aligns with your pcie link limit, what am i missing ?

hmm yes that was my concern also - cause cant find anywhere
informations
about that bandwidth is uni or bidirectional - so if 126Gbit for x16
8GT
is unidir - then bidir will be 126/2 ~68Gbit - which will fit total
bw
on both ports

i think it is bidir
So yes - we are hitting there other problem i think pcie is most 
probabbly bidirectional max bw 126Gbit so RX 126Gbit and at same time 
TX should be 126Gbit



So one 2-port 100G card connectx4 replaced with two separate connectx5 
placed in two different pcie x16 gen 3.0

lspci -vvv -s af:00.0
af:00.0 Ethernet controller: Mellanox Technologies MT27800 Family 
[ConnectX-5]

    Subsystem: Mellanox Technologies MT27800 Family [ConnectX-5]
    Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- 
ParErr- Stepping- SERR+ FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
SERR- 
    Latency: 0, Cache Line Size: 32 bytes
    Interrupt: pin A routed to IRQ 90
    NUMA node: 1
    Region 0: Memory at 39bffe00 (64-bit, prefetchable) [size=32M]
    Expansion ROM at ee60 [disabled] [size=1M]
    Capabilities: [60] Express (v2) Endpoint, MSI 00
    DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s 
unlimited, L1 unlimited
    ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ 
SlotPowerLimit 0.000W
    DevCtl: Report errors: Correctable- Non-Fatal- Fatal- 
Unsupported-
    RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ 
FLReset-

    MaxPayload 256 bytes, MaxReadReq 4096 bytes
    DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ 
AuxPwr- TransPend-
    LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not 
supported, Exit Latency L0s unlimited, L1 unlimited

    ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
    LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
    ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
    LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ 
DLActive- BWMgmt- ABWMgmt-
    DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, 
LTR-, OBFF Not Supported
    DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, 
LTR-, OBFF Disabled
    LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- 
SpeedDis-
 Transmit Margin: Normal Operating Range, 
EnterModifiedCompliance- ComplianceSOS-

 Compliance De-emphasis: -6dB
    Ln

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic





W dniu 08.11.2018 o 17:32, David Ahern pisze:

On 11/8/18 9:27 AM, Paweł Staszewski wrote:

What hardware is this?


mellanox connectx 4
ethtool -i enp175s0f0
driver: mlx5_core
version: 5.0-0
firmware-version: 12.21.1000 (SM_200101033)
expansion-rom-version:
bus-info: :af:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

ethtool -i enp175s0f1
driver: mlx5_core
version: 5.0-0
firmware-version: 12.21.1000 (SM_200101033)
expansion-rom-version:
bus-info: :af:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes


Start with:

echo 1 > /sys/kernel/debug/tracing/events/xdp/enable
cat /sys/kernel/debug/tracing/trace_pipe

  cat /sys/kernel/debug/tracing/trace_pipe
  -0 [045] ..s. 68469.467752: xdp_devmap_xmit:
ndo_xdp_xmit map_id=32 map_index=5 action=REDIRECT sent=0 drops=1
from_ifindex=4 to_ifindex=5 err=-6

FIB lookup is good, the redirect is happening, but the mlx5 driver does
not like it.

I think the -6 is coming from the mlx5 driver and the packet is getting
dropped. Perhaps this check in mlx5e_xdp_xmit:

if (unlikely(sq_num >= priv->channels.num))
 return -ENXIO;


Wondering about this:
swapper 0 [045] 68494.770287: fib:fib_table_lookup: table 254 oif 0 
iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0 ==> 
dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0

    7fff818c13b5 fib_table_lookup ([kernel.kallsyms])

oif 0 ?

Is that correct here ?





swapper 0 [045] 68493.746274: fib:fib_table_lookup: table 254 oif
0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0
==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0
     7fff818c13b5 fib_table_lookup ([kernel.kallsyms])

swapper 0 [045] 68494.770287: fib:fib_table_lookup: table 254 oif
0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0
==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0
     7fff818c13b5 fib_table_lookup ([kernel.kallsyms])

swapper 0 [045] 68495.794304: fib:fib_table_lookup: table 254 oif
0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0
==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0
     7fff818c13b5 fib_table_lookup ([kernel.kallsyms])

swapper 0 [045] 68496.818308: fib:fib_table_lookup: table 254 oif
0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0
==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0
     7fff818c13b5 fib_table_lookup ([kernel.kallsyms])

swapper 0 [045] 68497.842313: fib:fib_table_lookup: table 254 oif
0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0
==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0
     7fff818c13b5 fib_table_lookup ([kernel.kallsyms])

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic





W dniu 08.11.2018 o 17:25, Paweł Staszewski pisze:



W dniu 08.11.2018 o 17:06, David Ahern pisze:

On 11/8/18 6:33 AM, Paweł Staszewski wrote:


W dniu 07.11.2018 o 22:06, David Ahern pisze:

On 11/3/18 6:24 PM, Paweł Staszewski wrote:
Does your setup have any other device types besides physical 
ports with

VLANs (e.g., any macvlans or bonds)?



no.
just
phy(mlnx)->vlans only config

VLAN and non-VLAN (and a mix) seem to work ok. Patches are here:
 https://github.com/dsahern/linux.git bpf/kernel-tables-wip

I got lazy with the vlan exports; right now it requires 8021q to be
builtin (CONFIG_VLAN_8021Q=y)

You can use the xdp_fwd sample:
    make O=kbuild -C samples/bpf -j 8

Copy samples/bpf/xdp_fwd_kern.o and samples/bpf/xdp_fwd to the server
and run:
 ./xdp_fwd 

e.g., in my testing I run:
 xdp_fwd eth1 eth2 eth3 eth4

All of the relevant forwarding ports need to be on the same command
line. This version populates a second map to verify the egress port 
has

XDP enabled.

Installed today on some lab server with mellanox connectx4

And trying some simple static routing first - but after enabling xdp
program - receiver is not receiving frames

Route table is simple as possible for tests :)

icmp ping test send from 192.168.22.237 to 172.16.0.2 - incomming
packets on vlan 4081

ip r
default via 192.168.22.236 dev vlan4081
172.16.0.0/30 dev vlan1740 proto kernel scope link src 172.16.0.1
192.168.22.0/24 dev vlan4081 proto kernel scope link src 192.168.22.205

neigh table:
ip neigh ls

192.168.22.237 dev vlan4081 lladdr 00:25:90:fb:a6:8d REACHABLE
172.16.0.2 dev vlan1740 lladdr ac:1f:6b:2c:2e:5a REACHABLE

and interfaces:
4: enp175s0f0:  mtu 1500 qdisc mq 
state

UP mode DEFAULT group default qlen 1000
 link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
5: enp175s0f1:  mtu 1500 qdisc mq 
state

UP mode DEFAULT group default qlen 1000
 link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
6: vlan4081@enp175s0f0:  mtu 1500 
qdisc

noqueue state UP mode DEFAULT group default qlen 1000
 link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
7: vlan1740@enp175s0f1:  mtu 1500 
qdisc

noqueue state UP mode DEFAULT group default qlen 1000
 link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff

5: enp175s0f1:  mtu 1500 xdp/id:5 
qdisc

mq state UP group default qlen 1000
 link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
 inet6 fe80::ae1f:6bff:fe07:c891/64 scope link
    valid_lft forever preferred_lft forever
6: vlan4081@enp175s0f0:  mtu 1500 
qdisc

noqueue state UP group default qlen 1000
 link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
 inet 192.168.22.205/24 scope global vlan4081
    valid_lft forever preferred_lft forever
 inet6 fe80::ae1f:6bff:fe07:c890/64 scope link
    valid_lft forever preferred_lft forever
7: vlan1740@enp175s0f1:  mtu 1500 
qdisc

noqueue state UP group default qlen 1000
 link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
 inet 172.16.0.1/30 scope global vlan1740
    valid_lft forever preferred_lft forever
 inet6 fe80::ae1f:6bff:fe07:c891/64 scope link
    valid_lft forever preferred_lft forever


xdp program detached:
Receiving side tcpdump:
14:28:09.141233 IP 192.168.22.237 > 172.16.0.2: ICMP echo request, id
30227, seq 487, length 64

I can see icmp requests


enabling xdp
./xdp_fwd enp175s0f1 enp175s0f0

4: enp175s0f0:  mtu 1500 xdp qdisc mq
state UP mode DEFAULT group default qlen 1000
 link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
 prog/xdp id 5 tag 3c231ff1e5e77f3f
5: enp175s0f1:  mtu 1500 xdp qdisc mq
state UP mode DEFAULT group default qlen 1000
 link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
 prog/xdp id 5 tag 3c231ff1e5e77f3f
6: vlan4081@enp175s0f0:  mtu 1500 
qdisc

noqueue state UP mode DEFAULT group default qlen 1000
 link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
7: vlan1740@enp175s0f1:  mtu 1500 
qdisc

noqueue state UP mode DEFAULT group default qlen 1000
 link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff


What hardware is this?


mellanox connectx 4
ethtool -i enp175s0f0
driver: mlx5_core
version: 5.0-0
firmware-version: 12.21.1000 (SM_200101033)
expansion-rom-version:
bus-info: :af:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

ethtool -i enp175s0f1
driver: mlx5_core
version: 5.0-0
firmware-version: 12.21.1000 (SM_200101033)
expansion-rom-version:
bus-info: :af:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes


Start with:

echo 1 > /sys/kernel/debug/tracing/events/xdp/enable
cat /sys/kernel/debug/tracing/trace_pipe

 cat /sys/kernel/debug/tracing/trace_pipe
 -0 [045] ..s. 68469.467752: xdp_devmap_xmit: 
ndo_xdp_xmit map_id=32 map_index=5 action=REDIRECT sent=0 drops=1 
from_ifindex=4 to_ifindex=5 err=-6
  -0 [045] ..s. 68470.483836: xdp_red

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic





W dniu 08.11.2018 o 17:06, David Ahern pisze:

On 11/8/18 6:33 AM, Paweł Staszewski wrote:


W dniu 07.11.2018 o 22:06, David Ahern pisze:

On 11/3/18 6:24 PM, Paweł Staszewski wrote:

Does your setup have any other device types besides physical ports with
VLANs (e.g., any macvlans or bonds)?



no.
just
phy(mlnx)->vlans only config

VLAN and non-VLAN (and a mix) seem to work ok. Patches are here:
     https://github.com/dsahern/linux.git bpf/kernel-tables-wip

I got lazy with the vlan exports; right now it requires 8021q to be
builtin (CONFIG_VLAN_8021Q=y)

You can use the xdp_fwd sample:
    make O=kbuild -C samples/bpf -j 8

Copy samples/bpf/xdp_fwd_kern.o and samples/bpf/xdp_fwd to the server
and run:
     ./xdp_fwd 

e.g., in my testing I run:
     xdp_fwd eth1 eth2 eth3 eth4

All of the relevant forwarding ports need to be on the same command
line. This version populates a second map to verify the egress port has
XDP enabled.

Installed today on some lab server with mellanox connectx4

And trying some simple static routing first - but after enabling xdp
program - receiver is not receiving frames

Route table is simple as possible for tests :)

icmp ping test send from 192.168.22.237 to 172.16.0.2 - incomming
packets on vlan 4081

ip r
default via 192.168.22.236 dev vlan4081
172.16.0.0/30 dev vlan1740 proto kernel scope link src 172.16.0.1
192.168.22.0/24 dev vlan4081 proto kernel scope link src 192.168.22.205

neigh table:
ip neigh ls

192.168.22.237 dev vlan4081 lladdr 00:25:90:fb:a6:8d REACHABLE
172.16.0.2 dev vlan1740 lladdr ac:1f:6b:2c:2e:5a REACHABLE

and interfaces:
4: enp175s0f0:  mtu 1500 qdisc mq state
UP mode DEFAULT group default qlen 1000
     link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
5: enp175s0f1:  mtu 1500 qdisc mq state
UP mode DEFAULT group default qlen 1000
     link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
6: vlan4081@enp175s0f0:  mtu 1500 qdisc
noqueue state UP mode DEFAULT group default qlen 1000
     link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
7: vlan1740@enp175s0f1:  mtu 1500 qdisc
noqueue state UP mode DEFAULT group default qlen 1000
     link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff

5: enp175s0f1:  mtu 1500 xdp/id:5 qdisc
mq state UP group default qlen 1000
     link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
     inet6 fe80::ae1f:6bff:fe07:c891/64 scope link
    valid_lft forever preferred_lft forever
6: vlan4081@enp175s0f0:  mtu 1500 qdisc
noqueue state UP group default qlen 1000
     link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
     inet 192.168.22.205/24 scope global vlan4081
    valid_lft forever preferred_lft forever
     inet6 fe80::ae1f:6bff:fe07:c890/64 scope link
    valid_lft forever preferred_lft forever
7: vlan1740@enp175s0f1:  mtu 1500 qdisc
noqueue state UP group default qlen 1000
     link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
     inet 172.16.0.1/30 scope global vlan1740
    valid_lft forever preferred_lft forever
     inet6 fe80::ae1f:6bff:fe07:c891/64 scope link
    valid_lft forever preferred_lft forever


xdp program detached:
Receiving side tcpdump:
14:28:09.141233 IP 192.168.22.237 > 172.16.0.2: ICMP echo request, id
30227, seq 487, length 64

I can see icmp requests


enabling xdp
./xdp_fwd enp175s0f1 enp175s0f0

4: enp175s0f0:  mtu 1500 xdp qdisc mq
state UP mode DEFAULT group default qlen 1000
     link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
     prog/xdp id 5 tag 3c231ff1e5e77f3f
5: enp175s0f1:  mtu 1500 xdp qdisc mq
state UP mode DEFAULT group default qlen 1000
     link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
     prog/xdp id 5 tag 3c231ff1e5e77f3f
6: vlan4081@enp175s0f0:  mtu 1500 qdisc
noqueue state UP mode DEFAULT group default qlen 1000
     link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
7: vlan1740@enp175s0f1:  mtu 1500 qdisc
noqueue state UP mode DEFAULT group default qlen 1000
     link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff


What hardware is this?

Start with:

echo 1 > /sys/kernel/debug/tracing/events/xdp/enable
cat /sys/kernel/debug/tracing/trace_pipe

 cat /sys/kernel/debug/tracing/trace_pipe
 -0 [045] ..s. 68469.467752: xdp_devmap_xmit: 
ndo_xdp_xmit map_id=32 map_index=5 action=REDIRECT sent=0 drops=1 
from_ifindex=4 to_ifindex=5 err=-6
  -0 [045] ..s. 68470.483836: xdp_redirect_map: 
prog_id=30 action=REDIRECT ifindex=4 to_ifindex=5 err=0 map_id=32 
map_index=5
  -0 [045] ..s. 68470.483837: xdp_devmap_xmit: 
ndo_xdp_xmit map_id=32 map_index=5 action=REDIRECT sent=0 drops=1 
from_ifindex=4 to_ifindex=5 err=-6
  -0 [045] ..s. 68471.503853: xdp_redirect_map: 
prog_id=30 action=REDIRECT ifindex=4 to_ifindex=5 err=0 map_id=32 
map_index=5
  -0 [045] ..s. 68471.503853: xdp_devmap_xmit: 
ndo_xdp_xmit map_id=32 map_index=5 action=REDIRECT sent=0 drops=1 
from_ifindex=4 to_ifindex=5 err=-6
  -0 [045] ..s. 68472.527871: xdp_redirect_map: 
prog_id=30 action=REDIRECT

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic





W dniu 08.11.2018 o 01:59, Paweł Staszewski pisze:



W dniu 05.11.2018 o 21:17, Jesper Dangaard Brouer pisze:
On Sun, 4 Nov 2018 01:24:03 +0100 Paweł Staszewski 
 wrote:



And today again after allpy patch for page allocator - reached again
64/64 Gbit/s

with only 50-60% cpu load

Great.


today no slowpath hit for netwoking :)

But again dropped pckt at 64GbitRX and 64TX 
And as it should not be pcie express limit  -i think something more is

Well, this does sounds like a PCIe bandwidth limit to me.

See the PCIe BW here: https://en.wikipedia.org/wiki/PCI_Express

You likely have PCIe v3, where 1-lane have 984.6 MBytes/s or 7.87 Gbit/s
Thus,  x16-lanes have 15.75 GBytes or 126 Gbit/s.  It does say "in each
direction", but you are also forwarding this RX->TX on both (dual) ports
NIC that is sharing the same PCIe slot.
Network controller changed from 2-port 100G connectx4 to 2 separate 
cards 100G connectx5



   PerfTop:   92239 irqs/sec  kernel:99.4%  exact:  0.0% [4000Hz 
cycles],  (all, 56 CPUs)
--- 



 6.65%  [kernel]   [k] irq_entries_start
 5.57%  [kernel]   [k] tasklet_action_common.isra.21
 4.60%  [kernel]   [k] mlx5_eq_int
 4.04%  [kernel]   [k] mlx5e_skb_from_cqe_mpwrq_linear
 3.66%  [kernel]   [k] _raw_spin_lock_irqsave
 3.58%  [kernel]   [k] mlx5e_sq_xmit
 2.66%  [kernel]   [k] fib_table_lookup
 2.52%  [kernel]   [k] _raw_spin_lock
 2.51%  [kernel]   [k] build_skb
 2.50%  [kernel]   [k] _raw_spin_lock_irq
 2.04%  [kernel]   [k] try_to_wake_up
 1.83%  [kernel]   [k] queued_spin_lock_slowpath
 1.81%  [kernel]   [k] mlx5e_poll_tx_cq
 1.65%  [kernel]   [k] do_idle
 1.50%  [kernel]   [k] mlx5e_poll_rx_cq
 1.34%  [kernel]   [k] __sched_text_start
 1.32%  [kernel]   [k] cmd_exec
 1.30%  [kernel]   [k] cmd_work_handler
 1.16%  [kernel]   [k] vlan_do_receive
 1.15%  [kernel]   [k] memcpy_erms
 1.15%  [kernel]   [k] __dev_queue_xmit
 1.07%  [kernel]   [k] mlx5_cmd_comp_handler
 1.06%  [kernel]   [k] sched_ttwu_pending
 1.00%  [kernel]   [k] ipt_do_table
 0.98%  [kernel]   [k] ip_finish_output2
 0.92%  [kernel]   [k] pfifo_fast_dequeue
 0.88%  [kernel]   [k] mlx5e_handle_rx_cqe_mpwrq
 0.78%  [kernel]   [k] dev_gro_receive
 0.78%  [kernel]   [k] mlx5e_napi_poll
 0.76%  [kernel]   [k] mlx5e_post_rx_mpwqes
 0.70%  [kernel]   [k] process_one_work
 0.67%  [kernel]   [k] __netif_receive_skb_core
 0.65%  [kernel]   [k] __build_skb
 0.63%  [kernel]   [k] llist_add_batch
 0.62%  [kernel]   [k] tcp_gro_receive
 0.60%  [kernel]   [k] inet_gro_receive
 0.59%  [kernel]   [k] ip_route_input_rcu
 0.59%  [kernel]   [k] rcu_irq_exit
 0.56%  [kernel]   [k] napi_complete_done
 0.52%  [kernel]   [k] kmem_cache_alloc
 0.48%  [kernel]   [k] __softirqentry_text_start
 0.48%  [kernel]   [k] mlx5e_xmit
 0.47%  [kernel]   [k] __queue_work
 0.46%  [kernel]   [k] memset_erms
 0.46%  [kernel]   [k] dev_hard_start_xmit
 0.45%  [kernel]   [k] insert_work
 0.45%  [kernel]   [k] enqueue_task_fair
 0.44%  [kernel]   [k] __wake_up_common
 0.43%  [kernel]   [k] finish_task_switch
 0.43%  [kernel]   [k] kmem_cache_free_bulk
 0.42%  [kernel]   [k] ip_forward
 0.42%  [kernel]   [k] worker_thread
 0.41%  [kernel]   [k] schedule
 0.41%  [kernel]   [k] _raw_spin_unlock_irqrestore
 0.40%  [kernel]   [k] netif_skb_features
 0.40%  [kernel]   [k] queue_work_on
 0.40%  [kernel]   [k] pfifo_fast_enqueue
 0.39%  [kernel]   [k] vlan_dev_hard_start_xmit
 0.39%  [kernel]   [k] page_frag_free
 0.36%  [kernel]   [k] swiotlb_map_page
 0.36%  [kernel]   [k] update_cfs_rq_h_load
 0.35%  [kernel]   [k] validate_xmit_skb.isra.142
 0.35%  [kernel]   [k] dev_ifconf
 0.35%  [kernel]   [k] check_preempt_curr
 0.34%  [kernel]   [k] _raw_spin_trylock
 0.34%  [kernel]   [k] rcu_idle_exit
 0.33%  [kernel]   [k] ip_rcv_core.isra.20.constprop.25
 0.33%  [kernel]   [k] __qdisc_run
 0.33%  [kernel]   [k] skb_release_data
 0.32%  [kernel]   [k] native_sched_clock
 0.30%  [kernel]   [k] add_interrupt_randomness
 0.29%  [kernel]   [k] interrupt_entry
 0.28%  [kernel]   [k] skb_gro_receive
 0.26%  [kernel]   [k] read_tsc
 0.26%  [kernel]   [k] __get_xps_queue_idx
 0.26%  [kernel]   [k] inet_gifconf
 0.26%  [kernel]   [k] skb_segment
 0.25%  [ker

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic





W dniu 07.11.2018 o 22:06, David Ahern pisze:

On 11/3/18 6:24 PM, Paweł Staszewski wrote:

Does your setup have any other device types besides physical ports with
VLANs (e.g., any macvlans or bonds)?



no.
just
phy(mlnx)->vlans only config

VLAN and non-VLAN (and a mix) seem to work ok. Patches are here:
https://github.com/dsahern/linux.git bpf/kernel-tables-wip

I got lazy with the vlan exports; right now it requires 8021q to be
builtin (CONFIG_VLAN_8021Q=y)

You can use the xdp_fwd sample:
   make O=kbuild -C samples/bpf -j 8

Copy samples/bpf/xdp_fwd_kern.o and samples/bpf/xdp_fwd to the server
and run:
./xdp_fwd 

e.g., in my testing I run:
xdp_fwd eth1 eth2 eth3 eth4

All of the relevant forwarding ports need to be on the same command
line. This version populates a second map to verify the egress port has
XDP enabled.

Installed today on some lab server with mellanox connectx4

And trying some simple static routing first - but after enabling xdp 
program - receiver is not receiving frames


Route table is simple as possible for tests :)

icmp ping test send from 192.168.22.237 to 172.16.0.2 - incomming 
packets on vlan 4081


ip r
default via 192.168.22.236 dev vlan4081
172.16.0.0/30 dev vlan1740 proto kernel scope link src 172.16.0.1
192.168.22.0/24 dev vlan4081 proto kernel scope link src 192.168.22.205

neigh table:
ip neigh ls

192.168.22.237 dev vlan4081 lladdr 00:25:90:fb:a6:8d REACHABLE
172.16.0.2 dev vlan1740 lladdr ac:1f:6b:2c:2e:5a REACHABLE

and interfaces:
4: enp175s0f0:  mtu 1500 qdisc mq state 
UP mode DEFAULT group default qlen 1000

    link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
5: enp175s0f1:  mtu 1500 qdisc mq state 
UP mode DEFAULT group default qlen 1000

    link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
6: vlan4081@enp175s0f0:  mtu 1500 qdisc 
noqueue state UP mode DEFAULT group default qlen 1000

    link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
7: vlan1740@enp175s0f1:  mtu 1500 qdisc 
noqueue state UP mode DEFAULT group default qlen 1000

    link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff

5: enp175s0f1:  mtu 1500 xdp/id:5 qdisc 
mq state UP group default qlen 1000

    link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::ae1f:6bff:fe07:c891/64 scope link
   valid_lft forever preferred_lft forever
6: vlan4081@enp175s0f0:  mtu 1500 qdisc 
noqueue state UP group default qlen 1000

    link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
    inet 192.168.22.205/24 scope global vlan4081
   valid_lft forever preferred_lft forever
    inet6 fe80::ae1f:6bff:fe07:c890/64 scope link
   valid_lft forever preferred_lft forever
7: vlan1740@enp175s0f1:  mtu 1500 qdisc 
noqueue state UP group default qlen 1000

    link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
    inet 172.16.0.1/30 scope global vlan1740
   valid_lft forever preferred_lft forever
    inet6 fe80::ae1f:6bff:fe07:c891/64 scope link
   valid_lft forever preferred_lft forever


xdp program detached:
Receiving side tcpdump:
14:28:09.141233 IP 192.168.22.237 > 172.16.0.2: ICMP echo request, id 
30227, seq 487, length 64


I can see icmp requests


enabling xdp
./xdp_fwd enp175s0f1 enp175s0f0

4: enp175s0f0:  mtu 1500 xdp qdisc mq 
state UP mode DEFAULT group default qlen 1000

    link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
    prog/xdp id 5 tag 3c231ff1e5e77f3f
5: enp175s0f1:  mtu 1500 xdp qdisc mq 
state UP mode DEFAULT group default qlen 1000

    link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
    prog/xdp id 5 tag 3c231ff1e5e77f3f
6: vlan4081@enp175s0f0:  mtu 1500 qdisc 
noqueue state UP mode DEFAULT group default qlen 1000

    link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
7: vlan1740@enp175s0f1:  mtu 1500 qdisc 
noqueue state UP mode DEFAULT group default qlen 1000

    link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff



Receiving side no icmp echo requests incommint to interface.

And some ethtool stats for xdp interface trat receiving icmp requests 
from sender to be forwarded:

ethtool -S enp175s0f0 | grep 'rx_xdp_redirect'
 rx_xdp_redirect: 321

ethtool stats for interface that should forward icmp requests to 
receiver on vlan id 1740


ethtool -S enp175s0f1 | grep 'tx_xdp'
 tx_xdp_xmit: 0
 tx_xdp_full: 0
 tx_xdp_err: 0
 tx_xdp_cqes: 0


No frames tx-ed.








And today again after allpy patch for page allocator - reached again
64/64 Gbit/s

with only 50-60% cpu load

you should see the cpu load drop considerably.

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

2018-11-07 Thread Paweł Staszewski





W dniu 08.11.2018 o 01:59, Paweł Staszewski pisze:



W dniu 05.11.2018 o 21:17, Jesper Dangaard Brouer pisze:
On Sun, 4 Nov 2018 01:24:03 +0100 Paweł Staszewski 
 wrote:



And today again after allpy patch for page allocator - reached again
64/64 Gbit/s

with only 50-60% cpu load

Great.


today no slowpath hit for netwoking :)

But again dropped pckt at 64GbitRX and 64TX 
And as it should not be pcie express limit  -i think something more is

Well, this does sounds like a PCIe bandwidth limit to me.

See the PCIe BW here: https://en.wikipedia.org/wiki/PCI_Express

You likely have PCIe v3, where 1-lane have 984.6 MBytes/s or 7.87 Gbit/s
Thus,  x16-lanes have 15.75 GBytes or 126 Gbit/s.  It does say "in each
direction", but you are also forwarding this RX->TX on both (dual) ports
NIC that is sharing the same PCIe slot.
Network controller changed from 2-port 100G connectx4 to 2 separate 
cards 100G connectx5



   PerfTop:   92239 irqs/sec  kernel:99.4%  exact:  0.0% [4000Hz 
cycles],  (all, 56 CPUs)
--- 



 6.65%  [kernel]   [k] irq_entries_start
 5.57%  [kernel]   [k] tasklet_action_common.isra.21
 4.60%  [kernel]   [k] mlx5_eq_int
 4.04%  [kernel]   [k] mlx5e_skb_from_cqe_mpwrq_linear
 3.66%  [kernel]   [k] _raw_spin_lock_irqsave
 3.58%  [kernel]   [k] mlx5e_sq_xmit
 2.66%  [kernel]   [k] fib_table_lookup
 2.52%  [kernel]   [k] _raw_spin_lock
 2.51%  [kernel]   [k] build_skb
 2.50%  [kernel]   [k] _raw_spin_lock_irq
 2.04%  [kernel]   [k] try_to_wake_up
 1.83%  [kernel]   [k] queued_spin_lock_slowpath
 1.81%  [kernel]   [k] mlx5e_poll_tx_cq
 1.65%  [kernel]   [k] do_idle
 1.50%  [kernel]   [k] mlx5e_poll_rx_cq
 1.34%  [kernel]   [k] __sched_text_start
 1.32%  [kernel]   [k] cmd_exec
 1.30%  [kernel]   [k] cmd_work_handler
 1.16%  [kernel]   [k] vlan_do_receive
 1.15%  [kernel]   [k] memcpy_erms
 1.15%  [kernel]   [k] __dev_queue_xmit
 1.07%  [kernel]   [k] mlx5_cmd_comp_handler
 1.06%  [kernel]   [k] sched_ttwu_pending
 1.00%  [kernel]   [k] ipt_do_table
 0.98%  [kernel]   [k] ip_finish_output2
 0.92%  [kernel]   [k] pfifo_fast_dequeue
 0.88%  [kernel]   [k] mlx5e_handle_rx_cqe_mpwrq
 0.78%  [kernel]   [k] dev_gro_receive
 0.78%  [kernel]   [k] mlx5e_napi_poll
 0.76%  [kernel]   [k] mlx5e_post_rx_mpwqes
 0.70%  [kernel]   [k] process_one_work
 0.67%  [kernel]   [k] __netif_receive_skb_core
 0.65%  [kernel]   [k] __build_skb
 0.63%  [kernel]   [k] llist_add_batch
 0.62%  [kernel]   [k] tcp_gro_receive
 0.60%  [kernel]   [k] inet_gro_receive
 0.59%  [kernel]   [k] ip_route_input_rcu
 0.59%  [kernel]   [k] rcu_irq_exit
 0.56%  [kernel]   [k] napi_complete_done
 0.52%  [kernel]   [k] kmem_cache_alloc
 0.48%  [kernel]   [k] __softirqentry_text_start
 0.48%  [kernel]   [k] mlx5e_xmit
 0.47%  [kernel]   [k] __queue_work
 0.46%  [kernel]   [k] memset_erms
 0.46%  [kernel]   [k] dev_hard_start_xmit
 0.45%  [kernel]   [k] insert_work
 0.45%  [kernel]   [k] enqueue_task_fair
 0.44%  [kernel]   [k] __wake_up_common
 0.43%  [kernel]   [k] finish_task_switch
 0.43%  [kernel]   [k] kmem_cache_free_bulk
 0.42%  [kernel]   [k] ip_forward
 0.42%  [kernel]   [k] worker_thread
 0.41%  [kernel]   [k] schedule
 0.41%  [kernel]   [k] _raw_spin_unlock_irqrestore
 0.40%  [kernel]   [k] netif_skb_features
 0.40%  [kernel]   [k] queue_work_on
 0.40%  [kernel]   [k] pfifo_fast_enqueue
 0.39%  [kernel]   [k] vlan_dev_hard_start_xmit
 0.39%  [kernel]   [k] page_frag_free
 0.36%  [kernel]   [k] swiotlb_map_page
 0.36%  [kernel]   [k] update_cfs_rq_h_load
 0.35%  [kernel]   [k] validate_xmit_skb.isra.142
 0.35%  [kernel]   [k] dev_ifconf
 0.35%  [kernel]   [k] check_preempt_curr
 0.34%  [kernel]   [k] _raw_spin_trylock
 0.34%  [kernel]   [k] rcu_idle_exit
 0.33%  [kernel]   [k] ip_rcv_core.isra.20.constprop.25
 0.33%  [kernel]   [k] __qdisc_run
 0.33%  [kernel]   [k] skb_release_data
 0.32%  [kernel]   [k] native_sched_clock
 0.30%  [kernel]   [k] add_interrupt_randomness
 0.29%  [kernel]   [k] interrupt_entry
 0.28%  [kernel]   [k] skb_gro_receive
 0.26%  [kernel]   [k] read_tsc
 0.26%  [kernel]   [k] __get_xps_queue_idx
 0.26%  [kernel]   [k] inet_gifconf
 0.26%  [kernel]   [k] skb_segment
 0.25%  [ker

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

2018-11-07 Thread Paweł Staszewski





W dniu 05.11.2018 o 21:17, Jesper Dangaard Brouer pisze:

On Sun, 4 Nov 2018 01:24:03 +0100 Paweł Staszewski  
wrote:


And today again after allpy patch for page allocator - reached again
64/64 Gbit/s

with only 50-60% cpu load

Great.


today no slowpath hit for netwoking :)

But again dropped pckt at 64GbitRX and 64TX 
And as it should not be pcie express limit  -i think something more is

Well, this does sounds like a PCIe bandwidth limit to me.

See the PCIe BW here: https://en.wikipedia.org/wiki/PCI_Express

You likely have PCIe v3, where 1-lane have 984.6 MBytes/s or 7.87 Gbit/s
Thus,  x16-lanes have 15.75 GBytes or 126 Gbit/s.  It does say "in each
direction", but you are also forwarding this RX->TX on both (dual) ports
NIC that is sharing the same PCIe slot.
Network controller changed from 2-port 100G connectx4 to 2 separate 
cards 100G connectx5



   PerfTop:   92239 irqs/sec  kernel:99.4%  exact:  0.0% [4000Hz 
cycles],  (all, 56 CPUs)

---

 6.65%  [kernel]   [k] irq_entries_start
 5.57%  [kernel]   [k] tasklet_action_common.isra.21
 4.60%  [kernel]   [k] mlx5_eq_int
 4.04%  [kernel]   [k] mlx5e_skb_from_cqe_mpwrq_linear
 3.66%  [kernel]   [k] _raw_spin_lock_irqsave
 3.58%  [kernel]   [k] mlx5e_sq_xmit
 2.66%  [kernel]   [k] fib_table_lookup
 2.52%  [kernel]   [k] _raw_spin_lock
 2.51%  [kernel]   [k] build_skb
 2.50%  [kernel]   [k] _raw_spin_lock_irq
 2.04%  [kernel]   [k] try_to_wake_up
 1.83%  [kernel]   [k] queued_spin_lock_slowpath
 1.81%  [kernel]   [k] mlx5e_poll_tx_cq
 1.65%  [kernel]   [k] do_idle
 1.50%  [kernel]   [k] mlx5e_poll_rx_cq
 1.34%  [kernel]   [k] __sched_text_start
 1.32%  [kernel]   [k] cmd_exec
 1.30%  [kernel]   [k] cmd_work_handler
 1.16%  [kernel]   [k] vlan_do_receive
 1.15%  [kernel]   [k] memcpy_erms
 1.15%  [kernel]   [k] __dev_queue_xmit
 1.07%  [kernel]   [k] mlx5_cmd_comp_handler
 1.06%  [kernel]   [k] sched_ttwu_pending
 1.00%  [kernel]   [k] ipt_do_table
 0.98%  [kernel]   [k] ip_finish_output2
 0.92%  [kernel]   [k] pfifo_fast_dequeue
 0.88%  [kernel]   [k] mlx5e_handle_rx_cqe_mpwrq
 0.78%  [kernel]   [k] dev_gro_receive
 0.78%  [kernel]   [k] mlx5e_napi_poll
 0.76%  [kernel]   [k] mlx5e_post_rx_mpwqes
 0.70%  [kernel]   [k] process_one_work
 0.67%  [kernel]   [k] __netif_receive_skb_core
 0.65%  [kernel]   [k] __build_skb
 0.63%  [kernel]   [k] llist_add_batch
 0.62%  [kernel]   [k] tcp_gro_receive
 0.60%  [kernel]   [k] inet_gro_receive
 0.59%  [kernel]   [k] ip_route_input_rcu
 0.59%  [kernel]   [k] rcu_irq_exit
 0.56%  [kernel]   [k] napi_complete_done
 0.52%  [kernel]   [k] kmem_cache_alloc
 0.48%  [kernel]   [k] __softirqentry_text_start
 0.48%  [kernel]   [k] mlx5e_xmit
 0.47%  [kernel]   [k] __queue_work
 0.46%  [kernel]   [k] memset_erms
 0.46%  [kernel]   [k] dev_hard_start_xmit
 0.45%  [kernel]   [k] insert_work
 0.45%  [kernel]   [k] enqueue_task_fair
 0.44%  [kernel]   [k] __wake_up_common
 0.43%  [kernel]   [k] finish_task_switch
 0.43%  [kernel]   [k] kmem_cache_free_bulk
 0.42%  [kernel]   [k] ip_forward
 0.42%  [kernel]   [k] worker_thread
 0.41%  [kernel]   [k] schedule
 0.41%  [kernel]   [k] _raw_spin_unlock_irqrestore
 0.40%  [kernel]   [k] netif_skb_features
 0.40%  [kernel]   [k] queue_work_on
 0.40%  [kernel]   [k] pfifo_fast_enqueue
 0.39%  [kernel]   [k] vlan_dev_hard_start_xmit
 0.39%  [kernel]   [k] page_frag_free
 0.36%  [kernel]   [k] swiotlb_map_page
 0.36%  [kernel]   [k] update_cfs_rq_h_load
 0.35%  [kernel]   [k] validate_xmit_skb.isra.142
 0.35%  [kernel]   [k] dev_ifconf
 0.35%  [kernel]   [k] check_preempt_curr
 0.34%  [kernel]   [k] _raw_spin_trylock
 0.34%  [kernel]   [k] rcu_idle_exit
 0.33%  [kernel]   [k] ip_rcv_core.isra.20.constprop.25
 0.33%  [kernel]   [k] __qdisc_run
 0.33%  [kernel]   [k] skb_release_data
 0.32%  [kernel]   [k] native_sched_clock
 0.30%  [kernel]   [k] add_interrupt_randomness
 0.29%  [kernel]   [k] interrupt_entry
 0.28%  [kernel]   [k] skb_gro_receive
 0.26%  [kernel]   [k] read_tsc
 0.26%  [kernel]   [k] __get_xps_queue_idx
 0.26%  [kernel]   [k] inet_gifconf
 0.26%  [kernel]   [k] skb_segment
 0.25%  [kernel]   [k] __tasklet_schedule_common
 0

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic


n


W dniu 03.11.2018 o 18:32, David Ahern pisze:

On 11/1/18 11:30 AM, Paweł Staszewski wrote:
  
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/bpf/xdp_fwd_kern.c




I can try some tests on same hw but testlab configuration - will give it
a try :)


That version does not work with VLANs. I have patches for it but it
needs a bit more work before sending out. Perhaps I can get back to it
next week.


Will be nice - next week i will be able to replace network controller
and install separate two 100Gbit nics into two pciex x16 slots - so can
test without hitting pcie bandwidth limits.



Does your setup have any other device types besides physical ports with
VLANs (e.g., any macvlans or bonds)?



no.
just
phy(mlnx)->vlans only config

And today again after allpy patch for page allocator - reached again 
64/64 Gbit/s


with only 50-60% cpu load

today no slowpath hit for netwoking :)

But again dropped pckt at 64GbitRX and 64TX 
And as it should not be pcie express limit  -i think something more is 
going on there - and hard to catch - cause perf top doestn chenged 
besides there is no queued slowpath hit now



I ordered now also intel cards to compare - but 3 weeks eta
Faster - cause 3 days - i will have mellanox connectx 5 - so can 
separate traffic to two different x16 pcie busses

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic





W dniu 03.11.2018 o 16:23, Paweł Staszewski pisze:



W dniu 03.11.2018 o 13:58, Jesper Dangaard Brouer pisze:

On Sat, 3 Nov 2018 01:16:08 +0100
Paweł Staszewski  wrote:


W dniu 02.11.2018 o 20:02, Paweł Staszewski pisze:


W dniu 02.11.2018 o 15:20, Aaron Lu pisze:
On Fri, Nov 02, 2018 at 12:40:37PM +0100, Jesper Dangaard Brouer 
wrote:

On Fri, 2 Nov 2018 13:23:56 +0800
Aaron Lu  wrote:

On Thu, Nov 01, 2018 at 08:23:19PM +, Saeed Mahameed wrote:

On Thu, 2018-11-01 at 23:27 +0800, Aaron Lu wrote:

On Thu, Nov 01, 2018 at 10:22:13AM +0100, Jesper Dangaard Brouer
wrote:
... ...

[...]

TL;DR: this is order-0 pages (code-walk trough proof below)

To Aaron, the network stack *can* call __free_pages_ok() with 
order-0

pages, via:

[...]
  I think here is a problem - order 0 pages are freed directly to 
buddy,

bypassing per-cpu-pages. This might be the reason lock contention
appeared on free path. Can someone apply below diff and see if lock
contention is gone?


Will test it tonight

Patch applied
perf report:
https://ufile.io/sytfh


But i need to wait also with more traffic currently cpu's are sleeping


Well, that would be the expected result, that the CPUs get more time to
sleep, if the lock contention is gone...

What is the measured bandwidth now?

30 RX /30 TX Gbit/s



Notice, you might still be limited by the PCIe bandwidth, but then your
CPUs might actually decide to sleep, as they are getting data fast
enough.
Yes - i will replace network controller to two separate nic's in two 
separate x16 pcie

But after monday.

But i dont think i hit pcie limit there - it looks like pcie x16 gen3 
have 16GB/s RX and 16GB/s TX so bidirectional


Was thinking that maybee memory limit - but also there is 4 channel DDR4 
2666MHz - so total bandwidth for memory is bigger (48GB/s) than needed 
for 100Gbit ethernet










[...]

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e2ef1c17942f..65c0ae13215a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4554,8 +4554,14 @@ void page_frag_free(void *addr)
   {
   struct page *page = virt_to_head_page(addr);
   -    if (unlikely(put_page_testzero(page)))
-    __free_pages_ok(page, compound_order(page));
+    if (unlikely(put_page_testzero(page))) {
+    unsigned int order = compound_order(page);
+
+    if (order == 0)
+    free_unref_page(page);
+    else
+    __free_pages_ok(page, order);
+    }

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic





W dniu 03.11.2018 o 13:58, Jesper Dangaard Brouer pisze:

On Sat, 3 Nov 2018 01:16:08 +0100
Paweł Staszewski  wrote:


W dniu 02.11.2018 o 20:02, Paweł Staszewski pisze:


W dniu 02.11.2018 o 15:20, Aaron Lu pisze:

On Fri, Nov 02, 2018 at 12:40:37PM +0100, Jesper Dangaard Brouer wrote:

On Fri, 2 Nov 2018 13:23:56 +0800
Aaron Lu  wrote:
  

On Thu, Nov 01, 2018 at 08:23:19PM +, Saeed Mahameed wrote:

On Thu, 2018-11-01 at 23:27 +0800, Aaron Lu wrote:

On Thu, Nov 01, 2018 at 10:22:13AM +0100, Jesper Dangaard Brouer
wrote:
... ...

[...]

TL;DR: this is order-0 pages (code-walk trough proof below)

To Aaron, the network stack *can* call __free_pages_ok() with order-0
pages, via:

[...]
  
I think here is a problem - order 0 pages are freed directly to buddy,

bypassing per-cpu-pages. This might be the reason lock contention
appeared on free path. Can someone apply below diff and see if lock
contention is gone?


Will test it tonight
  

Patch applied
perf report:
https://ufile.io/sytfh


But i need to wait also with more traffic currently cpu's are sleeping


Well, that would be the expected result, that the CPUs get more time to
sleep, if the lock contention is gone...

What is the measured bandwidth now?

30 RX /30 TX Gbit/s



Notice, you might still be limited by the PCIe bandwidth, but then your
CPUs might actually decide to sleep, as they are getting data fast
enough.
Yes - i will replace network controller to two separate nic's in two 
separate x16 pcie

But after monday.

But i dont think i hit pcie limit there - it looks like pcie x16 gen3 
have 16GB/s RX and 16GB/s TX so bidirectional





[...]

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e2ef1c17942f..65c0ae13215a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4554,8 +4554,14 @@ void page_frag_free(void *addr)
   {
   struct page *page = virt_to_head_page(addr);
   -    if (unlikely(put_page_testzero(page)))
-    __free_pages_ok(page, compound_order(page));
+    if (unlikely(put_page_testzero(page))) {
+    unsigned int order = compound_order(page);
+
+    if (order == 0)
+    free_unref_page(page);
+    else
+    __free_pages_ok(page, order);
+    }

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic





W dniu 03.11.2018 o 01:16, Paweł Staszewski pisze:



W dniu 02.11.2018 o 20:02, Paweł Staszewski pisze:



W dniu 02.11.2018 o 15:20, Aaron Lu pisze:

On Fri, Nov 02, 2018 at 12:40:37PM +0100, Jesper Dangaard Brouer wrote:

On Fri, 2 Nov 2018 13:23:56 +0800
Aaron Lu  wrote:


On Thu, Nov 01, 2018 at 08:23:19PM +, Saeed Mahameed wrote:

On Thu, 2018-11-01 at 23:27 +0800, Aaron Lu wrote:

On Thu, Nov 01, 2018 at 10:22:13AM +0100, Jesper Dangaard Brouer
wrote:
... ...

Section copied out:

   mlx5e_poll_tx_cq
   |
    --16.34%--napi_consume_skb
  |
  |--12.65%--__free_pages_ok
  |  |
  |   --11.86%--free_one_page
  | |
  | |--10.10%
--queued_spin_lock_slowpath
  | |
  | --0.65%--_raw_spin_lock
This callchain looks like it is freeing higher order pages than 
order

0:
__free_pages_ok is only called for pages whose order are bigger 
than

0.
mlx5 rx uses only order 0 pages, so i don't know where these high 
order

tx SKBs are coming from..

Perhaps here:
__netdev_alloc_skb(), __napi_alloc_skb(), __netdev_alloc_frag() and
__napi_alloc_frag() will all call page_frag_alloc(), which will use
__page_frag_cache_refill() to get an order 3 page if possible, or 
fall

back to an order 0 page if order 3 page is not available.

I'm not sure if your workload will use the above code path though.

TL;DR: this is order-0 pages (code-walk trough proof below)

To Aaron, the network stack *can* call __free_pages_ok() with order-0
pages, via:

static void skb_free_head(struct sk_buff *skb)
{
unsigned char *head = skb->head;

if (skb->head_frag)
    skb_free_frag(head);
else
    kfree(head);
}

static inline void skb_free_frag(void *addr)
{
page_frag_free(addr);
}

/*
  * Frees a page fragment allocated out of either a compound or 
order 0 page.

  */
void page_frag_free(void *addr)
{
struct page *page = virt_to_head_page(addr);

if (unlikely(put_page_testzero(page)))
    __free_pages_ok(page, compound_order(page));
}
EXPORT_SYMBOL(page_frag_free);

I think here is a problem - order 0 pages are freed directly to buddy,
bypassing per-cpu-pages. This might be the reason lock contention
appeared on free path. Can someone apply below diff and see if lock
contention is gone?

Will test it tonight


Patch applied
perf report:
https://ufile.io/sytfh



But i need to wait also with more traffic currently cpu's are sleeping


before patch:
    | | |  | 
|--13.55%--mlx5e_poll_tx_cq
  | | |  
|  |  |
  | | |  
|  | --10.32%--napi_consume_skb
  | | |  
|  | |
  | | |  
|  | |--8.52%--__free_pages_ok
  | | |  
|  | |  |
  | | |  
|  | |   --7.67%--free_one_page
  | | |  
|  | | |
  | | |  
|  | | |--6.05%--queued_spin_lock_slowpath
  | | |  
|  | | |
  | | |  
|  | |  --0.64%--_raw_spin_lock
  | | |  
|  | |
  | | |  
|  | |--0.77%--skb_release_data
  | | |  
|  | |
  | | |  
|  | --0.72%--page_frag_free


after patch:
  |  | | |  | 
|--3.75%--mlx5e_poll_tx_cq
   |  | | |  
|  |  |
   |  | | |  
|  | --1.53%--napi_consume_skb
   |  | | |  
|  | |
   |  | | |  
|  | --0.54%--skb_release_data
   |  | | |  
|  |
   |  | | |  | 
--3.09

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

2018-11-02 Thread Paweł Staszewski





W dniu 01.11.2018 o 21:37, Saeed Mahameed pisze:

On Thu, 2018-11-01 at 12:09 +0100, Paweł Staszewski wrote:

W dniu 01.11.2018 o 10:50, Saeed Mahameed pisze:

On Wed, 2018-10-31 at 22:57 +0100, Paweł Staszewski wrote:

Hi

So maybee someone will be interested how linux kernel handles
normal
traffic (not pktgen :) )


Server HW configuration:

CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz

NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)


Server software:

FRR - as routing daemon

enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to
local
numa
node)

enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local
numa
node)


Maximum traffic that server can handle:

Bandwidth

bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
 input: /proc/net/dev type: rate
 \ iface   Rx TxTotal
=

=
  enp175s0f1:  28.51 Gb/s   37.24
Gb/s
65.74 Gb/s
  enp175s0f0:  38.07 Gb/s   28.44
Gb/s
66.51 Gb/s
---

---
   total:  66.58 Gb/s   65.67
Gb/s
132.25 Gb/s


Packets per second:

bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
 input: /proc/net/dev type: rate
 - iface   Rx TxTotal
=

=
  enp175s0f1:  5248589.00 P/s   3486617.75 P/s
8735207.00 P/s
  enp175s0f0:  3557944.25 P/s   5232516.00 P/s
8790460.00 P/s
---

---
   total:  8806533.00 P/s   8719134.00 P/s
17525668.00 P/s


After reaching that limits nics on the upstream side (more RX
traffic)
start to drop packets


I just dont understand that server can't handle more bandwidth
(~40Gbit/s is limit where all cpu's are 100% util) - where pps on
RX
side are increasing.


Where do you see 40 Gb/s ? you showed that both ports on the same
NIC (
same pcie link) are doing  66.58 Gb/s (RX) + 65.67 Gb/s (TX) =
132.25
Gb/s which aligns with your pcie link limit, what am i missing ?

hmm yes that was my concern also - cause cant find anywhere
informations
about that bandwidth is uni or bidirectional - so if 126Gbit for x16
8GT
is unidir - then bidir will be 126/2 ~68Gbit - which will fit total
bw
on both ports

i think it is bidir
So yes - we are hitting there other problem i think pcie is most 
probabbly bidirectional max bw 126Gbit so RX 126Gbit and at same time TX 
should be 126Gbit







This can explain maybee also why cpuload is rising rapidly from
120Gbit/s in total to 132Gbit (counters of bwmng are from /proc/net -
so
there can be some error in reading them when offloading (gro/gso/tso)
on
nic's is enabled that is why


Was thinking that maybee reached some pcie x16 limit - but x16
8GT
is
126Gbit - and also when testing with pktgen i can reach more bw
and
pps
(like 4x more comparing to normal internet traffic)


Are you forwarding when using pktgen as well or you just testing
the RX
side pps ?

Yes pktgen was tested on single port RX
Can check also forwarding to eliminate pciex limits


So this explains why you have more RX pps, since tx is idle and pcie
will be free to do only rx.

[...]



ethtool -S enp175s0f1
NIC statistics:
rx_packets: 173730800927
rx_bytes: 99827422751332
tx_packets: 142532009512
tx_bytes: 184633045911222
tx_tso_packets: 25989113891
tx_tso_bytes: 132933363384458
tx_tso_inner_packets: 0
tx_tso_inner_bytes: 0
tx_added_vlan_packets: 74630239613
tx_nop: 2029817748
rx_lro_packets: 0
rx_lro_bytes: 0
rx_ecn_mark: 0
rx_removed_vlan_packets: 173730800927
rx_csum_unnecessary: 0
rx_csum_none: 434357
rx_csum_complete: 173730366570
rx_csum_unnecessary_inner: 0
rx_xdp_drop: 0
rx_xdp_redirect: 0
rx_xdp_tx_xmit: 0
rx_xdp_tx_full: 0
rx_xdp_tx_err: 0
rx_xdp_tx_cqe: 0
tx_csum_none: 38260960853
tx_csum_partial: 36369278774
tx_csum_partial_inner: 0
tx_queue_stopped: 1
tx_queue_dropped: 0
tx_xmit_more: 748638099
tx_recover: 0
tx_cqes: 73881645031
tx_queue_wake: 1
tx_udp_seg_rem: 0
tx_cqe_err: 0
tx_xdp_xmit: 0
tx_xdp_full: 0
tx_xdp_err: 0
tx_xdp_cqes: 0
rx_wqe_err: 0
rx_mpwqe_filler_cqes: 0
rx_mpwqe_filler_strides: 0
rx_buff_alloc_err: 0
rx_cqe_compress_blks: 0
rx_cqe_compress_pkts: 0

If this is a pcie bottleneck it might be useful to  enable CQE
compression (to reduce PCIe completion descriptors transactions)
you s

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

2018-11-02 Thread Paweł Staszewski





W dniu 02.11.2018 o 20:02, Paweł Staszewski pisze:



W dniu 02.11.2018 o 15:20, Aaron Lu pisze:

On Fri, Nov 02, 2018 at 12:40:37PM +0100, Jesper Dangaard Brouer wrote:

On Fri, 2 Nov 2018 13:23:56 +0800
Aaron Lu  wrote:


On Thu, Nov 01, 2018 at 08:23:19PM +, Saeed Mahameed wrote:

On Thu, 2018-11-01 at 23:27 +0800, Aaron Lu wrote:

On Thu, Nov 01, 2018 at 10:22:13AM +0100, Jesper Dangaard Brouer
wrote:
... ...

Section copied out:

   mlx5e_poll_tx_cq
   |
    --16.34%--napi_consume_skb
  |
  |--12.65%--__free_pages_ok
  |  |
  |   --11.86%--free_one_page
  | |
  | |--10.10%
--queued_spin_lock_slowpath
  | |
  | --0.65%--_raw_spin_lock
This callchain looks like it is freeing higher order pages than 
order

0:
__free_pages_ok is only called for pages whose order are bigger than
0.
mlx5 rx uses only order 0 pages, so i don't know where these high 
order

tx SKBs are coming from..

Perhaps here:
__netdev_alloc_skb(), __napi_alloc_skb(), __netdev_alloc_frag() and
__napi_alloc_frag() will all call page_frag_alloc(), which will use
__page_frag_cache_refill() to get an order 3 page if possible, or fall
back to an order 0 page if order 3 page is not available.

I'm not sure if your workload will use the above code path though.

TL;DR: this is order-0 pages (code-walk trough proof below)

To Aaron, the network stack *can* call __free_pages_ok() with order-0
pages, via:

static void skb_free_head(struct sk_buff *skb)
{
unsigned char *head = skb->head;

if (skb->head_frag)
    skb_free_frag(head);
else
    kfree(head);
}

static inline void skb_free_frag(void *addr)
{
page_frag_free(addr);
}

/*
  * Frees a page fragment allocated out of either a compound or 
order 0 page.

  */
void page_frag_free(void *addr)
{
struct page *page = virt_to_head_page(addr);

if (unlikely(put_page_testzero(page)))
    __free_pages_ok(page, compound_order(page));
}
EXPORT_SYMBOL(page_frag_free);

I think here is a problem - order 0 pages are freed directly to buddy,
bypassing per-cpu-pages. This might be the reason lock contention
appeared on free path. Can someone apply below diff and see if lock
contention is gone?

Will test it tonight


Patch applied
perf report:
https://ufile.io/sytfh



But i need to wait also with more traffic currently cpu's are sleeping









diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e2ef1c17942f..65c0ae13215a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4554,8 +4554,14 @@ void page_frag_free(void *addr)
  {
  struct page *page = virt_to_head_page(addr);
  -    if (unlikely(put_page_testzero(page)))
-    __free_pages_ok(page, compound_order(page));
+    if (unlikely(put_page_testzero(page))) {
+    unsigned int order = compound_order(page);
+
+    if (order == 0)
+    free_unref_page(page);
+    else
+    __free_pages_ok(page, order);
+    }
  }
  EXPORT_SYMBOL(page_frag_free);

Notice for the mlx5 driver it support several RX-memory models, so it
can be hard to follow, but from the perf report output we can see that
is uses mlx5e_skb_from_cqe_linear, which use build_skb.

--13.63%--mlx5e_skb_from_cqe_linear
   |
    --5.02%--build_skb
  |
   --1.85%--__build_skb
 |
  --1.00%--kmem_cache_alloc

/* build_skb() is wrapper over __build_skb(), that specifically
  * takes care of skb->head and skb->pfmemalloc
  * This means that if @frag_size is not zero, then @data must be 
backed

  * by a page fragment, not kmalloc() or vmalloc()
  */
struct sk_buff *build_skb(void *data, unsigned int frag_size)
{
struct sk_buff *skb = __build_skb(data, frag_size);

if (skb && frag_size) {
    skb->head_frag = 1;
    if (page_is_pfmemalloc(virt_to_head_page(data)))
    skb->pfmemalloc = 1;
}
return skb;
}
EXPORT_SYMBOL(build_skb);

It still doesn't prove, that the @data is backed by by a order-0 page.
For the mlx5 driver is uses mlx5e_page_alloc_mapped ->
page_pool_dev_alloc_pages(), and I can see perf report using
__page_pool_alloc_pages_slow().

The setup for page_pool in mlx5 uses order=0.

/* Create a page_pool and register it with rxq */
pp_params.order = 0;
pp_params.flags = 0; /* No-internal DMA mapping in page_pool */
pp_params.pool_size = pool_size;
pp_params.nid   = cpu_to_node(c->cpu);
pp_params.dev   = c->pdev;
pp_params.dma_dir   = rq->buff.map_dir;

/* page_pool can be used even when there is no rq->xdp_prog,
 * given page_pool does not handle DMA mapping there is no
 * required state to clear. And page_pool gracefully handle
 * elevated refcnt.
 */
rq->page_p

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

2018-11-02 Thread Paweł Staszewski





W dniu 02.11.2018 o 15:20, Aaron Lu pisze:

On Fri, Nov 02, 2018 at 12:40:37PM +0100, Jesper Dangaard Brouer wrote:

On Fri, 2 Nov 2018 13:23:56 +0800
Aaron Lu  wrote:


On Thu, Nov 01, 2018 at 08:23:19PM +, Saeed Mahameed wrote:

On Thu, 2018-11-01 at 23:27 +0800, Aaron Lu wrote:

On Thu, Nov 01, 2018 at 10:22:13AM +0100, Jesper Dangaard Brouer
wrote:
... ...

Section copied out:

   mlx5e_poll_tx_cq
   |
--16.34%--napi_consume_skb
  |
  |--12.65%--__free_pages_ok
  |  |
  |   --11.86%--free_one_page
  | |
  | |--10.10%
--queued_spin_lock_slowpath
  | |
  |  --0.65%--_raw_spin_lock

This callchain looks like it is freeing higher order pages than order
0:
__free_pages_ok is only called for pages whose order are bigger than
0.

mlx5 rx uses only order 0 pages, so i don't know where these high order
tx SKBs are coming from..

Perhaps here:
__netdev_alloc_skb(), __napi_alloc_skb(), __netdev_alloc_frag() and
__napi_alloc_frag() will all call page_frag_alloc(), which will use
__page_frag_cache_refill() to get an order 3 page if possible, or fall
back to an order 0 page if order 3 page is not available.

I'm not sure if your workload will use the above code path though.

TL;DR: this is order-0 pages (code-walk trough proof below)

To Aaron, the network stack *can* call __free_pages_ok() with order-0
pages, via:

static void skb_free_head(struct sk_buff *skb)
{
unsigned char *head = skb->head;

if (skb->head_frag)
skb_free_frag(head);
else
kfree(head);
}

static inline void skb_free_frag(void *addr)
{
page_frag_free(addr);
}

/*
  * Frees a page fragment allocated out of either a compound or order 0 page.
  */
void page_frag_free(void *addr)
{
struct page *page = virt_to_head_page(addr);

if (unlikely(put_page_testzero(page)))
__free_pages_ok(page, compound_order(page));
}
EXPORT_SYMBOL(page_frag_free);

I think here is a problem - order 0 pages are freed directly to buddy,
bypassing per-cpu-pages. This might be the reason lock contention
appeared on free path. Can someone apply below diff and see if lock
contention is gone?

Will test it tonight





diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e2ef1c17942f..65c0ae13215a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4554,8 +4554,14 @@ void page_frag_free(void *addr)
  {
struct page *page = virt_to_head_page(addr);
  
-	if (unlikely(put_page_testzero(page)))

-   __free_pages_ok(page, compound_order(page));
+   if (unlikely(put_page_testzero(page))) {
+   unsigned int order = compound_order(page);
+
+   if (order == 0)
+   free_unref_page(page);
+   else
+   __free_pages_ok(page, order);
+   }
  }
  EXPORT_SYMBOL(page_frag_free);
  

Notice for the mlx5 driver it support several RX-memory models, so it
can be hard to follow, but from the perf report output we can see that
is uses mlx5e_skb_from_cqe_linear, which use build_skb.

--13.63%--mlx5e_skb_from_cqe_linear
   |
--5.02%--build_skb
  |
   --1.85%--__build_skb
 |
  --1.00%--kmem_cache_alloc

/* build_skb() is wrapper over __build_skb(), that specifically
  * takes care of skb->head and skb->pfmemalloc
  * This means that if @frag_size is not zero, then @data must be backed
  * by a page fragment, not kmalloc() or vmalloc()
  */
struct sk_buff *build_skb(void *data, unsigned int frag_size)
{
struct sk_buff *skb = __build_skb(data, frag_size);

if (skb && frag_size) {
skb->head_frag = 1;
if (page_is_pfmemalloc(virt_to_head_page(data)))
skb->pfmemalloc = 1;
}
return skb;
}
EXPORT_SYMBOL(build_skb);

It still doesn't prove, that the @data is backed by by a order-0 page.
For the mlx5 driver is uses mlx5e_page_alloc_mapped ->
page_pool_dev_alloc_pages(), and I can see perf report using
__page_pool_alloc_pages_slow().

The setup for page_pool in mlx5 uses order=0.

/* Create a page_pool and register it with rxq */
pp_params.order = 0;
pp_params.flags = 0; /* No-internal DMA mapping in page_pool */
pp_params.pool_size = pool_size;
pp_params.nid   = cpu_to_node(c->cpu);
pp_params.dev   = c->pdev;
pp_params.dma_dir   = rq->buff.map_dir;

/* page_pool can be used even when there is no rq->xdp_prog,
 * given page_pool does not handle DMA mapping there is no
 * required state to clear. And page_pool gracefully handle
 * elevated refcnt.
 */
rq->page_pool = page_pool_create(&pp_params

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic





W dniu 01.11.2018 o 22:24, Paweł Staszewski pisze:



W dniu 01.11.2018 o 22:18, Paweł Staszewski pisze:



W dniu 01.11.2018 o 21:37, Saeed Mahameed pisze:

On Thu, 2018-11-01 at 12:09 +0100, Paweł Staszewski wrote:

W dniu 01.11.2018 o 10:50, Saeed Mahameed pisze:

On Wed, 2018-10-31 at 22:57 +0100, Paweł Staszewski wrote:

Hi

So maybee someone will be interested how linux kernel handles
normal
traffic (not pktgen :) )


Server HW configuration:

CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz

NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)


Server software:

FRR - as routing daemon

enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to
local
numa
node)

enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local
numa
node)


Maximum traffic that server can handle:

Bandwidth

    bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
 input: /proc/net/dev type: rate
 \ iface   Rx Tx    Total
=

=
  enp175s0f1:  28.51 Gb/s 37.24
Gb/s
65.74 Gb/s
  enp175s0f0:  38.07 Gb/s 28.44
Gb/s
66.51 Gb/s
---

---
   total:  66.58 Gb/s 65.67
Gb/s
132.25 Gb/s


Packets per second:

    bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
 input: /proc/net/dev type: rate
 - iface   Rx Tx    Total
=

=
  enp175s0f1:  5248589.00 P/s 3486617.75 P/s
8735207.00 P/s
  enp175s0f0:  3557944.25 P/s 5232516.00 P/s
8790460.00 P/s
---

---
   total:  8806533.00 P/s 8719134.00 P/s
17525668.00 P/s


After reaching that limits nics on the upstream side (more RX
traffic)
start to drop packets


I just dont understand that server can't handle more bandwidth
(~40Gbit/s is limit where all cpu's are 100% util) - where pps on
RX
side are increasing.


Where do you see 40 Gb/s ? you showed that both ports on the same
NIC (
same pcie link) are doing  66.58 Gb/s (RX) + 65.67 Gb/s (TX) =
132.25
Gb/s which aligns with your pcie link limit, what am i missing ?

hmm yes that was my concern also - cause cant find anywhere
informations
about that bandwidth is uni or bidirectional - so if 126Gbit for x16
8GT
is unidir - then bidir will be 126/2 ~68Gbit - which will fit total
bw
on both ports

i think it is bidir


This can explain maybee also why cpuload is rising rapidly from
120Gbit/s in total to 132Gbit (counters of bwmng are from /proc/net -
so
there can be some error in reading them when offloading (gro/gso/tso)
on
nic's is enabled that is why


Was thinking that maybee reached some pcie x16 limit - but x16
8GT
is
126Gbit - and also when testing with pktgen i can reach more bw
and
pps
(like 4x more comparing to normal internet traffic)


Are you forwarding when using pktgen as well or you just testing
the RX
side pps ?

Yes pktgen was tested on single port RX
Can check also forwarding to eliminate pciex limits


So this explains why you have more RX pps, since tx is idle and pcie
will be free to do only rx.

[...]



ethtool -S enp175s0f1
NIC statistics:
    rx_packets: 173730800927
    rx_bytes: 99827422751332
    tx_packets: 142532009512
    tx_bytes: 184633045911222
    tx_tso_packets: 25989113891
    tx_tso_bytes: 132933363384458
    tx_tso_inner_packets: 0
    tx_tso_inner_bytes: 0
    tx_added_vlan_packets: 74630239613
    tx_nop: 2029817748
    rx_lro_packets: 0
    rx_lro_bytes: 0
    rx_ecn_mark: 0
    rx_removed_vlan_packets: 173730800927
    rx_csum_unnecessary: 0
    rx_csum_none: 434357
    rx_csum_complete: 173730366570
    rx_csum_unnecessary_inner: 0
    rx_xdp_drop: 0
    rx_xdp_redirect: 0
    rx_xdp_tx_xmit: 0
    rx_xdp_tx_full: 0
    rx_xdp_tx_err: 0
    rx_xdp_tx_cqe: 0
    tx_csum_none: 38260960853
    tx_csum_partial: 36369278774
    tx_csum_partial_inner: 0
    tx_queue_stopped: 1
    tx_queue_dropped: 0
    tx_xmit_more: 748638099
    tx_recover: 0
    tx_cqes: 73881645031
    tx_queue_wake: 1
    tx_udp_seg_rem: 0
    tx_cqe_err: 0
    tx_xdp_xmit: 0
    tx_xdp_full: 0
    tx_xdp_err: 0
    tx_xdp_cqes: 0
    rx_wqe_err: 0
    rx_mpwqe_filler_cqes: 0
    rx_mpwqe_filler_strides: 0
    rx_buff_alloc_err: 0
    rx_cqe_compress_blks: 0
    rx_cqe_compress_pkts: 0

If this is a pcie bottleneck it might be useful to  enable CQE
compression (to reduce PCIe completion descriptors transactions)
you should see the above rx_cqe_compress_pkts increasing when
enabled.

$ ethtool  --set-priv-flags enp175s0f1

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic





W dniu 01.11.2018 o 22:18, Paweł Staszewski pisze:



W dniu 01.11.2018 o 21:37, Saeed Mahameed pisze:

On Thu, 2018-11-01 at 12:09 +0100, Paweł Staszewski wrote:

W dniu 01.11.2018 o 10:50, Saeed Mahameed pisze:

On Wed, 2018-10-31 at 22:57 +0100, Paweł Staszewski wrote:

Hi

So maybee someone will be interested how linux kernel handles
normal
traffic (not pktgen :) )


Server HW configuration:

CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz

NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)


Server software:

FRR - as routing daemon

enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to
local
numa
node)

enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local
numa
node)


Maximum traffic that server can handle:

Bandwidth

    bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
 input: /proc/net/dev type: rate
 \ iface   Rx Tx    Total
=

=
  enp175s0f1:  28.51 Gb/s   37.24
Gb/s
65.74 Gb/s
  enp175s0f0:  38.07 Gb/s   28.44
Gb/s
66.51 Gb/s
---

---
   total:  66.58 Gb/s   65.67
Gb/s
132.25 Gb/s


Packets per second:

    bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
 input: /proc/net/dev type: rate
 - iface   Rx Tx    Total
=

=
  enp175s0f1:  5248589.00 P/s   3486617.75 P/s
8735207.00 P/s
  enp175s0f0:  3557944.25 P/s   5232516.00 P/s
8790460.00 P/s
---

---
   total:  8806533.00 P/s   8719134.00 P/s
17525668.00 P/s


After reaching that limits nics on the upstream side (more RX
traffic)
start to drop packets


I just dont understand that server can't handle more bandwidth
(~40Gbit/s is limit where all cpu's are 100% util) - where pps on
RX
side are increasing.


Where do you see 40 Gb/s ? you showed that both ports on the same
NIC (
same pcie link) are doing  66.58 Gb/s (RX) + 65.67 Gb/s (TX) =
132.25
Gb/s which aligns with your pcie link limit, what am i missing ?

hmm yes that was my concern also - cause cant find anywhere
informations
about that bandwidth is uni or bidirectional - so if 126Gbit for x16
8GT
is unidir - then bidir will be 126/2 ~68Gbit - which will fit total
bw
on both ports

i think it is bidir


This can explain maybee also why cpuload is rising rapidly from
120Gbit/s in total to 132Gbit (counters of bwmng are from /proc/net -
so
there can be some error in reading them when offloading (gro/gso/tso)
on
nic's is enabled that is why


Was thinking that maybee reached some pcie x16 limit - but x16
8GT
is
126Gbit - and also when testing with pktgen i can reach more bw
and
pps
(like 4x more comparing to normal internet traffic)


Are you forwarding when using pktgen as well or you just testing
the RX
side pps ?

Yes pktgen was tested on single port RX
Can check also forwarding to eliminate pciex limits


So this explains why you have more RX pps, since tx is idle and pcie
will be free to do only rx.

[...]



ethtool -S enp175s0f1
NIC statistics:
    rx_packets: 173730800927
    rx_bytes: 99827422751332
    tx_packets: 142532009512
    tx_bytes: 184633045911222
    tx_tso_packets: 25989113891
    tx_tso_bytes: 132933363384458
    tx_tso_inner_packets: 0
    tx_tso_inner_bytes: 0
    tx_added_vlan_packets: 74630239613
    tx_nop: 2029817748
    rx_lro_packets: 0
    rx_lro_bytes: 0
    rx_ecn_mark: 0
    rx_removed_vlan_packets: 173730800927
    rx_csum_unnecessary: 0
    rx_csum_none: 434357
    rx_csum_complete: 173730366570
    rx_csum_unnecessary_inner: 0
    rx_xdp_drop: 0
    rx_xdp_redirect: 0
    rx_xdp_tx_xmit: 0
    rx_xdp_tx_full: 0
    rx_xdp_tx_err: 0
    rx_xdp_tx_cqe: 0
    tx_csum_none: 38260960853
    tx_csum_partial: 36369278774
    tx_csum_partial_inner: 0
    tx_queue_stopped: 1
    tx_queue_dropped: 0
    tx_xmit_more: 748638099
    tx_recover: 0
    tx_cqes: 73881645031
    tx_queue_wake: 1
    tx_udp_seg_rem: 0
    tx_cqe_err: 0
    tx_xdp_xmit: 0
    tx_xdp_full: 0
    tx_xdp_err: 0
    tx_xdp_cqes: 0
    rx_wqe_err: 0
    rx_mpwqe_filler_cqes: 0
    rx_mpwqe_filler_strides: 0
    rx_buff_alloc_err: 0
    rx_cqe_compress_blks: 0
    rx_cqe_compress_pkts: 0

If this is a pcie bottleneck it might be useful to  enable CQE
compression (to reduce PCIe completion descriptors transactions)
you should see the above rx_cqe_compress_pkts increasing when
enabled.

$ ethtool  --set-priv-flags enp175s0f1 rx_cqe

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic





W dniu 01.11.2018 o 21:37, Saeed Mahameed pisze:

On Thu, 2018-11-01 at 12:09 +0100, Paweł Staszewski wrote:

W dniu 01.11.2018 o 10:50, Saeed Mahameed pisze:

On Wed, 2018-10-31 at 22:57 +0100, Paweł Staszewski wrote:

Hi

So maybee someone will be interested how linux kernel handles
normal
traffic (not pktgen :) )


Server HW configuration:

CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz

NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)


Server software:

FRR - as routing daemon

enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to
local
numa
node)

enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local
numa
node)


Maximum traffic that server can handle:

Bandwidth

bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
 input: /proc/net/dev type: rate
 \ iface   Rx TxTotal
=

=
  enp175s0f1:  28.51 Gb/s   37.24
Gb/s
65.74 Gb/s
  enp175s0f0:  38.07 Gb/s   28.44
Gb/s
66.51 Gb/s
---

---
   total:  66.58 Gb/s   65.67
Gb/s
132.25 Gb/s


Packets per second:

bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
 input: /proc/net/dev type: rate
 - iface   Rx TxTotal
=

=
  enp175s0f1:  5248589.00 P/s   3486617.75 P/s
8735207.00 P/s
  enp175s0f0:  3557944.25 P/s   5232516.00 P/s
8790460.00 P/s
---

---
   total:  8806533.00 P/s   8719134.00 P/s
17525668.00 P/s


After reaching that limits nics on the upstream side (more RX
traffic)
start to drop packets


I just dont understand that server can't handle more bandwidth
(~40Gbit/s is limit where all cpu's are 100% util) - where pps on
RX
side are increasing.


Where do you see 40 Gb/s ? you showed that both ports on the same
NIC (
same pcie link) are doing  66.58 Gb/s (RX) + 65.67 Gb/s (TX) =
132.25
Gb/s which aligns with your pcie link limit, what am i missing ?

hmm yes that was my concern also - cause cant find anywhere
informations
about that bandwidth is uni or bidirectional - so if 126Gbit for x16
8GT
is unidir - then bidir will be 126/2 ~68Gbit - which will fit total
bw
on both ports

i think it is bidir


This can explain maybee also why cpuload is rising rapidly from
120Gbit/s in total to 132Gbit (counters of bwmng are from /proc/net -
so
there can be some error in reading them when offloading (gro/gso/tso)
on
nic's is enabled that is why


Was thinking that maybee reached some pcie x16 limit - but x16
8GT
is
126Gbit - and also when testing with pktgen i can reach more bw
and
pps
(like 4x more comparing to normal internet traffic)


Are you forwarding when using pktgen as well or you just testing
the RX
side pps ?

Yes pktgen was tested on single port RX
Can check also forwarding to eliminate pciex limits


So this explains why you have more RX pps, since tx is idle and pcie
will be free to do only rx.

[...]



ethtool -S enp175s0f1
NIC statistics:
rx_packets: 173730800927
rx_bytes: 99827422751332
tx_packets: 142532009512
tx_bytes: 184633045911222
tx_tso_packets: 25989113891
tx_tso_bytes: 132933363384458
tx_tso_inner_packets: 0
tx_tso_inner_bytes: 0
tx_added_vlan_packets: 74630239613
tx_nop: 2029817748
rx_lro_packets: 0
rx_lro_bytes: 0
rx_ecn_mark: 0
rx_removed_vlan_packets: 173730800927
rx_csum_unnecessary: 0
rx_csum_none: 434357
rx_csum_complete: 173730366570
rx_csum_unnecessary_inner: 0
rx_xdp_drop: 0
rx_xdp_redirect: 0
rx_xdp_tx_xmit: 0
rx_xdp_tx_full: 0
rx_xdp_tx_err: 0
rx_xdp_tx_cqe: 0
tx_csum_none: 38260960853
tx_csum_partial: 36369278774
tx_csum_partial_inner: 0
tx_queue_stopped: 1
tx_queue_dropped: 0
tx_xmit_more: 748638099
tx_recover: 0
tx_cqes: 73881645031
tx_queue_wake: 1
tx_udp_seg_rem: 0
tx_cqe_err: 0
tx_xdp_xmit: 0
tx_xdp_full: 0
tx_xdp_err: 0
tx_xdp_cqes: 0
rx_wqe_err: 0
rx_mpwqe_filler_cqes: 0
rx_mpwqe_filler_strides: 0
rx_buff_alloc_err: 0
rx_cqe_compress_blks: 0
rx_cqe_compress_pkts: 0

If this is a pcie bottleneck it might be useful to  enable CQE
compression (to reduce PCIe completion descriptors transactions)
you should see the above rx_cqe_compress_pkts increasing when
enabled.

$ ethtool  --set-priv-flags enp175s0f1 rx_cqe_compress on
$ ethtool --show-priv-flags enp175s

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

W dniu 01.11.2018 o 18:23, David Ahern pisze:

On 11/1/18 7:52 AM, Paweł Staszewski wrote:

W dniu 01.11.2018 o 11:55, Jesper Dangaard Brouer pisze:

On Wed, 31 Oct 2018 21:37:16 -0600 David Ahern wrote:

This is mainly a forwarding use case? Seems so based on the perf report.
I suspect forwarding with XDP would show pretty good improvement.

Yes, significant performance improvements.

Notice Davids talk: "Leveraging Kernel Tables with XDP"
http://vger.kernel.org/lpc-networking2018.html#session-1

It will be rly interesting

It's pushing the exact use case you have: FRR manages the FIB, XDP
programs get access to updates as they happen for fast path forwarding.

Cant wait then :)

It looks like that you are doing "pure" IP-routing, without any
iptables conntrack stuff (from your perf report data). That will
actually be a really good use-case for accelerating this with XDP.

Yes pure IP routing
iptables used only for some local input filtering.

I want you to understand the philosophy behind how David and I want
people to leverage XDP. Think of XDP as a software offload layer for
the kernel network stack. Setup and use Linux kernel network stack, but
accelerate parts of it with XDP, e.g. the route FIB lookup.

Sample code avail here:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/bpf/xdp_fwd_kern.c

I can try some tests on same hw but testlab configuration - will give it
a try :)

That version does not work with VLANs. I have patches for it but it
needs a bit more work before sending out. Perhaps I can get back to it
next week.

Will be nice - next week i will be able to replace network controller
and install separate two 100Gbit nics into two pciex x16 slots - so can
test without hitting pcie bandwidth limits.

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic





W dniu 01.11.2018 o 12:09, Paweł Staszewski pisze:

rx_cqe_compress_pkts: 0

If this is a pcie bottleneck it might be useful to  enable CQE
compression (to reduce PCIe completion descriptors transactions)
you should see the above rx_cqe_compress_pkts increasing when enabled.

$ ethtool  --set-priv-flags enp175s0f1 rx_cqe_compress on
$ ethtool --show-priv-flags enp175s0f1
Private flags for p6p1:
rx_cqe_moder   : on
cqe_moder  : off
rx_cqe_compress    : on
...

try this on both interfaces.

Done
ethtool --show-priv-flags enp175s0f1
Private flags for enp175s0f1:
rx_cqe_moder   : on
tx_cqe_moder   : off
rx_cqe_compress    : on
rx_striding_rq : off
rx_no_csum_complete: off

ethtool --show-priv-flags enp175s0f0
Private flags for enp175s0f0:
rx_cqe_moder   : on
tx_cqe_moder   : off
rx_cqe_compress    : on
rx_striding_rq : off
rx_no_csum_complete: off 
Enabling cqe compress changes nothing after reaching 64Gbit RX / 
64Gbit/s TX on interfaces cpu's are saturated at 100%


ethtool -S enp175s0f1 | grep rx_cqe_compress
 rx_cqe_compress_blks: 5657836379
 rx_cqe_compress_pkts: 13153761080

ethtool -S enp175s0f0 | grep rx_cqe_compress
 rx_cqe_compress_blks: 5994612500
 rx_cqe_compress_pkts: 13579014869


 bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
  input: /proc/net/dev type: rate
  - iface   Rx Tx    Total
==
   enp175s0f1:  27.03 Gb/s   37.09 Gb/s   
64.12 Gb/s
   enp175s0f0:  36.84 Gb/s   26.82 Gb/s   
63.66 Gb/s

--
    total:  63.85 Gb/s   63.87 Gb/s 127.72 Gb/s

bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
  input: /proc/net/dev type: rate
  / iface   Rx Tx    Total
==
   enp175s0f1:   3.22 GB/s    4.26 GB/s    
7.48 GB/s
   enp175s0f0:   4.24 GB/s    3.21 GB/s    
7.45 GB/s

--
    total:   7.46 GB/s    7.47 GB/s   
14.93 GB/s


mpstat
Average: CPU    %usr   %nice    %sys %iowait    %irq   %soft %steal  
%guest  %gnice   %idle
Average: all    0.05    0.00    0.19    0.02    0.00   42.74 0.00    
0.00    0.00   56.99
Average:   0    0.00    0.00    0.00    0.00    0.00    0.00 0.00    
0.00    0.00  100.00
Average:   1    0.00    0.00    0.30    0.00    0.00    0.00 0.00    
0.00    0.00   99.70
Average:   2    0.00    0.00    0.20    0.00    0.00    0.00 0.00    
0.00    0.00   99.80
Average:   3    0.00    0.00    0.20    1.20    0.00    0.00 0.00    
0.00    0.00   98.60
Average:   4    0.10    0.00    0.00    0.00    0.00    0.00 0.00    
0.00    0.00   99.90
Average:   5    0.00    0.00    0.10    0.00    0.00    0.00 0.00    
0.00    0.00   99.90
Average:   6    0.10    0.00    0.20    0.00    0.00    0.00 0.00    
0.00    0.00   99.70
Average:   7    0.00    0.00    0.00    0.00    0.00    0.00 0.00    
0.00    0.00  100.00
Average:   8    0.00    0.00    0.00    0.00    0.00    0.00 0.00    
0.00    0.00  100.00
Average:   9    0.00    0.00    0.00    0.00    0.00    0.00 0.00    
0.00    0.00  100.00
Average:  10    0.00    0.00    0.00    0.00    0.00    0.00 0.00    
0.00    0.00  100.00
Average:  11    0.00    0.00    0.00    0.00    0.00    0.00 0.00    
0.00    0.00  100.00
Average:  12    1.40    0.00    4.50    0.00    0.00    0.00 0.00    
0.00    0.00   94.10
Average:  13    0.00    0.00    1.60    0.00    0.00    0.00 0.00    
0.00    0.00   98.40
Average:  14    0.00    0.00    0.00    0.00    0.00   84.10 0.00    
0.00    0.00   15.90
Average:  15    0.00    0.00    0.10    0.00    0.00   93.70 0.00    
0.00    0.00    6.20
Average:  16    0.00    0.00    0.10    0.00    0.00   94.31 0.00    
0.00    0.00    5.59
Average:  17    0.00    0.00    0.00    0.00    0.00   95.30 0.00    
0.00    0.00    4.70
Average:  18    0.00    0.00    0.00    0.00    0.00   62.80 0.00    
0.00    0.00   37.20
Average:  19    0.00    0.00    0.10    0.00    0.00   98.90 0.00    
0.00    0.00    1.00
Average:  20    0.00    0.00    0.00    0.00    0.00   99.30 0.00    
0.00    0.00    0.70
Average:  21    0.00    0.00    0.00    0.00    0.00  100.00 0.00    
0.00    0.00    0.00
Average:  22    0.00    0.00    0.00    0.00    0.00   99.90 0.00    
0.00    0.00    0.10
Average:  23    0.00    0.00    0.10    0.00    0.00   99.90 0.00    
0.00    0.00    0.00
Average:  24    0.00    0.00    0.10    0.00    0.00   97.10 0.00    
0.00    0.00    2.80
Average:  2

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

W dniu 01.11.2018 o 11:55, Jesper Dangaard Brouer pisze:

On Wed, 31 Oct 2018 21:37:16 -0600 David Ahern wrote:

This is mainly a forwarding use case? Seems so based on the perf report.
I suspect forwarding with XDP would show pretty good improvement.

Yes, significant performance improvements.

Notice Davids talk: "Leveraging Kernel Tables with XDP"
http://vger.kernel.org/lpc-networking2018.html#session-1

It will be rly interesting

It looks like that you are doing "pure" IP-routing, without any
iptables conntrack stuff (from your perf report data). That will
actually be a really good use-case for accelerating this with XDP.

Yes pure IP routing
iptables used only for some local input filtering.

Sample code avail here:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/bpf/xdp_fwd_kern.c
I can try some tests on same hw but testlab configuration - will give it
a try :)

(I do warn, what we just found a bug/crash in setup+tairdown for the
mlx5 driver you are using, that we/mlx _will_ fix soon)

You need the vlan changes I have queued up though.

I know Yoel will be very interested in those changes too! I've
convinced Yoel to write an XDP program for his Border Network Gateway
(BNG) production system[1], and his is a heavy VLAN user. And the plan
is to Open Source this when he have-something-working.

[1] https://www.version2.dk/blog/software-router-del-5-linux-bng-1086060

Ok - for now i need to split traffic into two separate 100G ports placed
in two different x16 pciexpress slots to check if the problem is mainly
caused by no more pciex x16 bandwidth available.

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic





W dniu 01.11.2018 o 10:22, Jesper Dangaard Brouer pisze:

On Wed, 31 Oct 2018 23:20:01 +0100
Paweł Staszewski  wrote:


W dniu 31.10.2018 o 23:09, Eric Dumazet pisze:

On 10/31/2018 02:57 PM, Paweł Staszewski wrote:

Hi

So maybee someone will be interested how linux kernel handles
normal traffic (not pktgen :) )

Pawel is this live production traffic?
Yes moved server from testlab to production to check (risking a little - 
but this is traffic switched to backup router : ) )




I know Yoel (Cc) is very interested to know the real-life limitation of
Linux as a router, especially with VLANs like you use.
So yes this is real-life traffic , real users - normal mixed internet 
traffic forwarded (including ddos-es :) )








Server HW configuration:

CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz

NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)


Server software:

FRR - as routing daemon

enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local numa node)

enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local numa node)


Maximum traffic that server can handle:

Bandwidth

   bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
    input: /proc/net/dev type: rate
    \ iface   Rx Tx    Total
==
     enp175s0f1:  28.51 Gb/s   37.24 Gb/s   65.74 
Gb/s
     enp175s0f0:  38.07 Gb/s   28.44 Gb/s   66.51 
Gb/s
--
      total:  66.58 Gb/s   65.67 Gb/s  132.25 
Gb/s


Actually rather impressive number for a Linux router.


Packets per second:

   bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
    input: /proc/net/dev type: rate
    - iface   Rx Tx    Total
==
     enp175s0f1:  5248589.00 P/s   3486617.75 P/s 8735207.00 P/s
     enp175s0f0:  3557944.25 P/s   5232516.00 P/s 8790460.00 P/s
--
      total:  8806533.00 P/s   8719134.00 P/s 17525668.00 P/s


Average packet size:
   (28.51*10^9/8)/5248589 =  678.99 bytes
   (38.07*10^9/8)/3557944 = 1337.49 bytes



After reaching that limits nics on the upstream side (more RX
traffic) start to drop packets


I just dont understand that server can't handle more bandwidth
(~40Gbit/s is limit where all cpu's are 100% util) - where pps on
RX side are increasing.

Was thinking that maybee reached some pcie x16 limit - but x16 8GT
is 126Gbit - and also when testing with pktgen i can reach more bw
and pps (like 4x more comparing to normal internet traffic)

And wondering if there is something that can be improved here.



Some more informations / counters / stats and perf top below:

Perf top flame graph:

https://uploadfiles.io/7zo6u

Thanks a lot for the flame graph!


System configuration(long):


cat /sys/devices/system/node/node1/cpulist
14-27,42-55
cat /sys/class/net/enp175s0f0/device/numa_node
1
cat /sys/class/net/enp175s0f1/device/numa_node
1


Hint grep can give you nicer output that cat:

$ grep -H . /sys/class/net/*/device/numa_node

Sure:
grep -H . /sys/class/net/*/device/numa_node
/sys/class/net/enp175s0f0/device/numa_node:1
/sys/class/net/enp175s0f1/device/numa_node:1











ip -s -d link ls dev enp175s0f0
6: enp175s0f0:  mtu 1500 qdisc mq state UP 
mode DEFAULT group default qlen 8192
      link/ether 0c:c4:7a:d8:5d:1c brd ff:ff:ff:ff:ff:ff promiscuity 0 
addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 
gso_max_segs 65535
      RX: bytes  packets  errors  dropped overrun mcast
      184142375840858 141347715974 2   2806325 0   85050528
      TX: bytes  packets  errors  dropped carrier collsns
      99270697277430 172227994003 0   0   0   0

   ip -s -d link ls dev enp175s0f1
7: enp175s0f1:  mtu 1500 qdisc mq state UP 
mode DEFAULT group default qlen 8192
      link/ether 0c:c4:7a:d8:5d:1d brd ff:ff:ff:ff:ff:ff promiscuity 0 
addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 
gso_max_segs 65535
      RX: bytes  packets  errors  dropped overrun mcast
      99686284170801 173507590134 61  669685  0   100304421
      TX: bytes  packets  errors  dropped carrier collsns
      184435107970545 142383178304 0   0   0   0


You have increased the default (1000) qlen to 8192, why?

Was checking if higher txq will change anything
But no change for settings 1000,4096,8192
But yes i do not use there any traffic shaping like hfsc/hdb etc
- just default qdisc mq 0:
root pfifp_fast
tc qdisc show dev enp175s0f1
qdisc mq 0: root
qdisc pfifo_fast 0: parent :38 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 
1 1 1 1
qdisc pfifo

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic





W dniu 31.10.2018 o 23:20, Paweł Staszewski pisze:



W dniu 31.10.2018 o 23:09, Eric Dumazet pisze:


On 10/31/2018 02:57 PM, Paweł Staszewski wrote:

Hi

So maybee someone will be interested how linux kernel handles normal 
traffic (not pktgen :) )



Server HW configuration:

CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz

NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)


Server software:

FRR - as routing daemon

enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local 
numa node)


enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local 
numa node)



Maximum traffic that server can handle:

Bandwidth

  bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
   input: /proc/net/dev type: rate
   \ iface   Rx Tx Total
== 

    enp175s0f1:  28.51 Gb/s   37.24 
Gb/s   65.74 Gb/s
    enp175s0f0:  38.07 Gb/s   28.44 
Gb/s   66.51 Gb/s
-- 

 total:  66.58 Gb/s   65.67 
Gb/s  132.25 Gb/s



Packets per second:

  bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
   input: /proc/net/dev type: rate
   - iface   Rx Tx Total
== 

    enp175s0f1:  5248589.00 P/s   3486617.75 P/s 
8735207.00 P/s
    enp175s0f0:  3557944.25 P/s   5232516.00 P/s 
8790460.00 P/s
-- 

 total:  8806533.00 P/s   8719134.00 P/s 
17525668.00 P/s



After reaching that limits nics on the upstream side (more RX 
traffic) start to drop packets



I just dont understand that server can't handle more bandwidth 
(~40Gbit/s is limit where all cpu's are 100% util) - where pps on RX 
side are increasing.


Was thinking that maybee reached some pcie x16 limit - but x16 8GT 
is 126Gbit - and also when testing with pktgen i can reach more bw 
and pps (like 4x more comparing to normal internet traffic)


And wondering if there is something that can be improved here.



Some more informations / counters / stats and perf top below:

Perf top flame graph:

https://uploadfiles.io/7zo6u



System configuration(long):


cat /sys/devices/system/node/node1/cpulist
14-27,42-55
cat /sys/class/net/enp175s0f0/device/numa_node
1
cat /sys/class/net/enp175s0f1/device/numa_node
1





ip -s -d link ls dev enp175s0f0
6: enp175s0f0:  mtu 1500 qdisc mq 
state UP mode DEFAULT group default qlen 8192
 link/ether 0c:c4:7a:d8:5d:1c brd ff:ff:ff:ff:ff:ff promiscuity 
0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 
65536 gso_max_segs 65535

 RX: bytes  packets  errors  dropped overrun mcast
 184142375840858 141347715974 2   2806325 0 85050528
 TX: bytes  packets  errors  dropped carrier collsns
 99270697277430 172227994003 0   0   0   0

  ip -s -d link ls dev enp175s0f1
7: enp175s0f1:  mtu 1500 qdisc mq 
state UP mode DEFAULT group default qlen 8192
 link/ether 0c:c4:7a:d8:5d:1d brd ff:ff:ff:ff:ff:ff promiscuity 
0 addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 
65536 gso_max_segs 65535

 RX: bytes  packets  errors  dropped overrun mcast
 99686284170801 173507590134 61  669685  0 100304421
 TX: bytes  packets  errors  dropped carrier collsns
 184435107970545 142383178304 0   0   0   0


./softnet.sh
cpu  total    dropped   squeezed  collision    rps flow_limit




    PerfTop:  108490 irqs/sec  kernel:99.6%  exact:  0.0% [4000Hz 
cycles],  (all, 56 CPUs)
--- 



 26.78%  [kernel]   [k] queued_spin_lock_slowpath

This is highly suspect.

A call graph (perf record -a -g sleep 1; perf report --stdio) would 
tell what is going on.

perf report:
https://ufile.io/rqp0h





With that many TX/RX queues, I would expect you to not use RPS/RFS, 
and have a 1/1 RX/TX mapping,

so I do not know what could request a spinlock contention.






And yes there is no RPF/RFS - just 1/1 RX/TX and affinity mapping on 
local cpu for the network controller for 28 RX+TX queues per nic .

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic





W dniu 31.10.2018 o 23:09, Eric Dumazet pisze:


On 10/31/2018 02:57 PM, Paweł Staszewski wrote:

Hi

So maybee someone will be interested how linux kernel handles normal traffic 
(not pktgen :) )


Server HW configuration:

CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz

NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)


Server software:

FRR - as routing daemon

enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to local numa node)

enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local numa node)


Maximum traffic that server can handle:

Bandwidth

  bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
   input: /proc/net/dev type: rate
   \ iface   Rx Tx    Total
==
    enp175s0f1:  28.51 Gb/s   37.24 Gb/s   65.74 
Gb/s
    enp175s0f0:  38.07 Gb/s   28.44 Gb/s   66.51 
Gb/s
--
     total:  66.58 Gb/s   65.67 Gb/s  132.25 
Gb/s


Packets per second:

  bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
   input: /proc/net/dev type: rate
   - iface   Rx Tx    Total
==
    enp175s0f1:  5248589.00 P/s   3486617.75 P/s 8735207.00 P/s
    enp175s0f0:  3557944.25 P/s   5232516.00 P/s 8790460.00 P/s
--
     total:  8806533.00 P/s   8719134.00 P/s 17525668.00 P/s


After reaching that limits nics on the upstream side (more RX traffic) start to 
drop packets


I just dont understand that server can't handle more bandwidth (~40Gbit/s is 
limit where all cpu's are 100% util) - where pps on RX side are increasing.

Was thinking that maybee reached some pcie x16 limit - but x16 8GT is 126Gbit - 
and also when testing with pktgen i can reach more bw and pps (like 4x more 
comparing to normal internet traffic)

And wondering if there is something that can be improved here.



Some more informations / counters / stats and perf top below:

Perf top flame graph:

https://uploadfiles.io/7zo6u



System configuration(long):


cat /sys/devices/system/node/node1/cpulist
14-27,42-55
cat /sys/class/net/enp175s0f0/device/numa_node
1
cat /sys/class/net/enp175s0f1/device/numa_node
1





ip -s -d link ls dev enp175s0f0
6: enp175s0f0:  mtu 1500 qdisc mq state UP 
mode DEFAULT group default qlen 8192
     link/ether 0c:c4:7a:d8:5d:1c brd ff:ff:ff:ff:ff:ff promiscuity 0 
addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 
gso_max_segs 65535
     RX: bytes  packets  errors  dropped overrun mcast
     184142375840858 141347715974 2   2806325 0   85050528
     TX: bytes  packets  errors  dropped carrier collsns
     99270697277430 172227994003 0   0   0   0

  ip -s -d link ls dev enp175s0f1
7: enp175s0f1:  mtu 1500 qdisc mq state UP 
mode DEFAULT group default qlen 8192
     link/ether 0c:c4:7a:d8:5d:1d brd ff:ff:ff:ff:ff:ff promiscuity 0 
addrgenmode eui64 numtxqueues 448 numrxqueues 56 gso_max_size 65536 
gso_max_segs 65535
     RX: bytes  packets  errors  dropped overrun mcast
     99686284170801 173507590134 61  669685  0   100304421
     TX: bytes  packets  errors  dropped carrier collsns
     184435107970545 142383178304 0   0   0   0


./softnet.sh
cpu  total    dropped   squeezed  collision    rps flow_limit




    PerfTop:  108490 irqs/sec  kernel:99.6%  exact:  0.0% [4000Hz cycles],  
(all, 56 CPUs)
---

     26.78%  [kernel]   [k] queued_spin_lock_slowpath

This is highly suspect.

A call graph (perf record -a -g sleep 1; perf report --stdio) would tell what 
is going on.

perf report:
https://ufile.io/rqp0h





With that many TX/RX queues, I would expect you to not use RPS/RFS, and have a 
1/1 RX/TX mapping,
so I do not know what could request a spinlock contention.

Re: Latest net-next kernel 4.19.0+





W dniu 30.10.2018 o 15:16, Eric Dumazet pisze:


On 10/30/2018 01:09 AM, Paweł Staszewski wrote:


W dniu 30.10.2018 o 08:29, Eric Dumazet pisze:

On 10/29/2018 11:09 PM, Dimitris Michailidis wrote:


Indeed this is a bug. I would expect it to produce frequent errors
though as many odd-length
packets would trigger it. Do you have RXFCS? Regardless, how
frequently do you see the problem?


Old kernels (before 88078d98d1bb) were simply resetting ip_summed to 
CHECKSUM_NONE

And before your fix (commit d55bef5059dd057bd), mlx5 bug was canceling the bug 
you fixed.

So we now need to also fix mlx5.

And of course use skb_header_pointer() in mlx5e_get_fcs() as I mentioned 
earlier,
plus __get_unaligned_cpu32() as you hinted.





No RXFCS

And this trace is rly frequently like once per 3/4 seconds
like below:
[28965.776864] vlan1490: hw csum failure

Might be vlan related.

Can you first check this :

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 
94224c22ecc310a87b6715051e335446f29bec03..6f4bfebf0d9a3ae7567062abb3ea6532b3aaf3d6
 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -789,13 +789,8 @@ static inline void mlx5e_handle_csum(struct net_device 
*netdev,
 skb->ip_summed = CHECKSUM_COMPLETE;
 skb->csum = csum_unfold((__force __sum16)cqe->check_sum);
 if (network_depth > ETH_HLEN)
-   /* CQE csum is calculated from the IP header and does
-* not cover VLAN headers (if present). This will add
-* the checksum manually.
-*/
-   skb->csum = csum_partial(skb->data + ETH_HLEN,
-network_depth - ETH_HLEN,
-skb->csum);
+   /* Temporary debugging */
+   skb->ip_summed = CHECKSUM_NONE;
 if (unlikely(netdev->features & NETIF_F_RXFCS))
 skb->csum = csum_add(skb->csum,
  (__force 
__wsum)mlx5e_get_fcs(skb));




Ok thanks - will try it.

Re: Latest net-next kernel 4.19.0+





W dniu 31.10.2018 o 22:05, Saeed Mahameed pisze:

On Tue, 2018-10-30 at 10:32 -0700, Cong Wang wrote:

On Tue, Oct 30, 2018 at 7:16 AM Eric Dumazet 
wrote:



On 10/30/2018 01:09 AM, Paweł Staszewski wrote:


W dniu 30.10.2018 o 08:29, Eric Dumazet pisze:

On 10/29/2018 11:09 PM, Dimitris Michailidis wrote:


Indeed this is a bug. I would expect it to produce frequent
errors
though as many odd-length
packets would trigger it. Do you have RXFCS? Regardless, how
frequently do you see the problem?


Old kernels (before 88078d98d1bb) were simply resetting
ip_summed to CHECKSUM_NONE

And before your fix (commit d55bef5059dd057bd), mlx5 bug was
canceling the bug you fixed.

So we now need to also fix mlx5.

And of course use skb_header_pointer() in mlx5e_get_fcs() as I
mentioned earlier,
plus __get_unaligned_cpu32() as you hinted.





No RXFCS


Same with Pawel, RXFCS is disabled by default.



And this trace is rly frequently like once per 3/4 seconds
like below:
[28965.776864] vlan1490: hw csum failure

Might be vlan related.

Hi Pawel, is the vlan stripping offload disabled or enabled in your
case ?

To verify:
ethtool -k  | grep rx-vlan-offload
rx-vlan-offload: on
To set:
ethtool -K  rxvlan on/off

Enabled:
ethtool -k enp175s0f0
Features for enp175s0f0:
rx-checksumming: on
tx-checksumming: on
    tx-checksum-ipv4: on
    tx-checksum-ip-generic: off [fixed]
    tx-checksum-ipv6: on
    tx-checksum-fcoe-crc: off [fixed]
    tx-checksum-sctp: off [fixed]
scatter-gather: on
    tx-scatter-gather: on
    tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
    tx-tcp-segmentation: on
    tx-tcp-ecn-segmentation: off [fixed]
    tx-tcp-mangleid-segmentation: off
    tx-tcp6-segmentation: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: off [fixed]
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: on
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off
rx-all: off
tx-vlan-stag-hw-insert: on
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: on [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: on
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: on
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]




if the vlan offload is off then it will trigger the mlx5e vlan csum
adjustment code pointed out by Eric.

Anyhow, it should work in both cases, but i am trying to narrow down
the possibilities.

Also could it be a double tagged packet ?

no double tagged packets there






Unlike Pawel's case, we don't use vlan at all, maybe this is why we
see
it much less frequently than Pawel.

Also, it is probably not specific to mlx5, as there is another report
which
is probably a non-mlx5 driver.


Cong, How often does this happen ? can you some how verify if the
problematic packet has extra end padding after the ip payload ?

It would be cool if we had a feature in kernel to store such SKB in
memory when such issue occurs, and let the user dump it later (via
tcpdump) and send the dump to the vendor for debug so we could just
replay and see what happens.


Thanks.

Re: Latest net-next kernel 4.19.0+

2018-10-30 Thread Paweł Staszewski





W dniu 30.10.2018 o 08:29, Eric Dumazet pisze:


On 10/29/2018 11:09 PM, Dimitris Michailidis wrote:


Indeed this is a bug. I would expect it to produce frequent errors
though as many odd-length
packets would trigger it. Do you have RXFCS? Regardless, how
frequently do you see the problem?


Old kernels (before 88078d98d1bb) were simply resetting ip_summed to 
CHECKSUM_NONE

And before your fix (commit d55bef5059dd057bd), mlx5 bug was canceling the bug 
you fixed.

So we now need to also fix mlx5.

And of course use skb_header_pointer() in mlx5e_get_fcs() as I mentioned 
earlier,
plus __get_unaligned_cpu32() as you hinted.






No RXFCS

And this trace is rly frequently like once per 3/4 seconds
like below:
[28965.776864] vlan1490: hw csum failure
[28965.776867] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.19.0+ #1
[28965.776868] Call Trace:
[28965.776870]  
[28965.776876]  dump_stack+0x46/0x5b
[28965.776879]  __skb_checksum_complete+0x9a/0xa0
[28965.776882]  tcp_v4_rcv+0xef/0x960
[28965.776884]  ip_local_deliver_finish+0x49/0xd0
[28965.776886]  ip_local_deliver+0x5e/0xe0
[28965.776888]  ? ip_sublist_rcv_finish+0x50/0x50
[28965.776889]  ip_rcv+0x41/0xc0
[28965.776891]  __netif_receive_skb_one_core+0x4b/0x70
[28965.776893]  netif_receive_skb_internal+0x2f/0xd0
[28965.776894]  napi_gro_receive+0xb7/0xe0
[28965.776897]  mlx5e_handle_rx_cqe+0x7a/0xd0
[28965.776899]  mlx5e_poll_rx_cq+0xc6/0x930
[28965.776900]  mlx5e_napi_poll+0xab/0xc90
[28965.776904]  ? kmem_cache_free_bulk+0x1e4/0x280
[28965.776905]  net_rx_action+0x1f1/0x320
[28965.776909]  __do_softirq+0xec/0x2b7
[28965.776912]  irq_exit+0x7b/0x80
[28965.776913]  do_IRQ+0x45/0xc0
[28965.776915]  common_interrupt+0xf/0xf
[28965.776916]  
[28965.776918] RIP: 0010:mwait_idle+0x5f/0x1b0
[28965.776919] Code: a8 01 0f 85 3f 01 00 00 31 d2 65 48 8b 04 25 80 4c 
01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 0f 85 40 01 00 00 31 c0 fb 0f 01 
c9 <65> 8b 2d 2a c9 6a 7e 0f 1f 44 00 00 65 48 8b 04 25 80 4c 01 00 f0
[28965.776920] RSP: 0018:82203e98 EFLAGS: 0246 ORIG_RAX: 
ffd3
[28965.776921] RAX:  RBX:  RCX: 

[28965.776922] RDX:  RSI:  RDI: 

[28965.776922] RBP:  R08: 00aa R09: 
88046f81fbc0
[28965.776923] R10:  R11: 0001006d5985 R12: 
8220f780
[28965.776924] R13: 8220f780 R14:  R15: 


[28965.776927]  do_idle+0x1a3/0x1c0
[28965.776929]  cpu_startup_entry+0x14/0x20
[28965.776932]  start_kernel+0x488/0x4a8
[28965.776935]  secondary_startup_64+0xa4/0xb0
[28965.981529] vlan1490: hw csum failure
[28965.981531] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.19.0+ #1
[28965.981532] Call Trace:
[28965.981534]  
[28965.981539]  dump_stack+0x46/0x5b
[28965.981543]  __skb_checksum_complete+0x9a/0xa0
[28965.981545]  tcp_v4_rcv+0xef/0x960
[28965.981548]  ip_local_deliver_finish+0x49/0xd0
[28965.981550]  ip_local_deliver+0x5e/0xe0
[28965.981551]  ? ip_sublist_rcv_finish+0x50/0x50
[28965.981552]  ip_rcv+0x41/0xc0
[28965.981555]  __netif_receive_skb_one_core+0x4b/0x70
[28965.981556]  netif_receive_skb_internal+0x2f/0xd0
[28965.981558]  napi_gro_receive+0xb7/0xe0
[28965.981560]  mlx5e_handle_rx_cqe+0x7a/0xd0
[28965.981562]  mlx5e_poll_rx_cq+0xc6/0x930
[28965.981563]  mlx5e_napi_poll+0xab/0xc90
[28965.981567]  ? kmem_cache_free_bulk+0x1e4/0x280
[28965.981568]  net_rx_action+0x1f1/0x320
[28965.981571]  __do_softirq+0xec/0x2b7
[28965.981575]  irq_exit+0x7b/0x80
[28965.981576]  do_IRQ+0x45/0xc0
[28965.981578]  common_interrupt+0xf/0xf
[28965.981579]  
[28965.981580] RIP: 0010:mwait_idle+0x5f/0x1b0
[28965.981582] Code: a8 01 0f 85 3f 01 00 00 31 d2 65 48 8b 04 25 80 4c 
01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 0f 85 40 01 00 00 31 c0 fb 0f 01 
c9 <65> 8b 2d 2a c9 6a 7e 0f 1f 44 00 00 65 48 8b 04 25 80 4c 01 00 f0
[28965.981583] RSP: 0018:82203e98 EFLAGS: 0246 ORIG_RAX: 
ffd3
[28965.981584] RAX:  RBX:  RCX: 

[28965.981585] RDX:  RSI:  RDI: 

[28965.981586] RBP:  R08: 0383 R09: 
88046f81fbc0
[28965.981586] R10:  R11: 0001006d59b8 R12: 
8220f780
[28965.981587] R13: 8220f780 R14:  R15: 


[28965.981591]  do_idle+0x1a3/0x1c0
[28965.981592]  cpu_startup_entry+0x14/0x20
[28965.981596]  start_kernel+0x488/0x4a8
[28965.981600]  secondary_startup_64+0xa4/0xb0
[28966.511782] vlan1490: hw csum failure
[28966.511785] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.19.0+ #1
[28966.511785] Call Trace:
[28966.511787]  
[28966.511793]  dump_stack+0x46/0x5b
[28966.511797]  __skb_checksum_complete+0x9a/0xa0
[28966.511799]  tcp_v4_rcv+0xef/0x960
[28966.511802]  ip_local_deliver_finish+0x49/0xd0
[28966.511804]  ip_local_deliver+0x5e/0xe0
[28966.511806]  ? ip_sublist_rcv_finish+0x50/0x50
[

Re: Latest net-next kernel 4.19.0+

2018-10-29 Thread Paweł Staszewski


W dniu 30.10.2018 o 01:11, Paweł Staszewski pisze:

Sorry not complete - followed by hw csum:

[  342.190831] vlan1490: hw csum failure
[  342.190835] CPU: 52 PID: 0 Comm: swapper/52 Not tainted 4.19.0+ #1
[  342.190836] Call Trace:
[  342.190839]  
[  342.190849]  dump_stack+0x46/0x5b
[  342.190856]  __skb_checksum_complete+0x9a/0xa0
[  342.190859]  tcp_v4_rcv+0xef/0x960
[  342.190864]  ip_local_deliver_finish+0x49/0xd0
[  342.190866]  ip_local_deliver+0x5e/0xe0
[  342.190869]  ? ip_sublist_rcv_finish+0x50/0x50
[  342.190870]  ip_rcv+0x41/0xc0
[  342.190874]  __netif_receive_skb_one_core+0x4b/0x70
[  342.190877]  netif_receive_skb_internal+0x2f/0xd0
[  342.190879]  napi_gro_receive+0xb7/0xe0
[  342.190884]  mlx5e_handle_rx_cqe+0x7a/0xd0
[  342.190886]  mlx5e_poll_rx_cq+0xc6/0x930
[  342.190888]  mlx5e_napi_poll+0xab/0xc90
[  342.190893]  ? kmem_cache_free_bulk+0x1e4/0x280
[  342.190895]  net_rx_action+0x1f1/0x320
[  342.190901]  __do_softirq+0xec/0x2b7
[  342.190908]  irq_exit+0x7b/0x80
[  342.190910]  do_IRQ+0x45/0xc0
[  342.190912]  common_interrupt+0xf/0xf
[  342.190914]  
[  342.190916] RIP: 0010:mwait_idle+0x5f/0x1b0
[  342.190917] Code: a8 01 0f 85 3f 01 00 00 31 d2 65 48 8b 04 25 80 
4c 01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 0f 85 40 01 00 00 31 c0 fb 
0f 01 c9 <65> 8b 2d 2a c9 6a 7e 0f 1f 44 00 00 65 48 8b 04 25 80 4c 01 
00 f0
[  342.190918] RSP: 0018:c900034e7eb8 EFLAGS: 0246 ORIG_RAX: 
ffdd
[  342.190920] RAX:  RBX: 0034 RCX: 

[  342.190921] RDX:  RSI:  RDI: 

[  342.190922] RBP: 0034 R08: 0057 R09: 
88086fa1fbc0
[  342.190923] R10:  R11: 000128cc R12: 
88086d18
[  342.190923] R13: 88086d18 R14:  R15: 


[  342.190929]  do_idle+0x1a3/0x1c0
[  342.190931]  cpu_startup_entry+0x14/0x20
[  342.190934]  start_secondary+0x165/0x190
[  342.190939]  secondary_startup_64+0xa4/0xb0


W dniu 30.10.2018 o 01:10, Paweł Staszewski pisze:

Hi


Just checked in test lab latest kernel and have weird traces:

[  219.888673] CPU: 52 PID: 0 Comm: swapper/52 Not tainted 4.19.0+ #1
[  219.888674] Call Trace:
[  219.888676]  
[  219.888685]  dump_stack+0x46/0x5b
[  219.888691]  __skb_checksum_complete+0x9a/0xa0
[  219.888694]  tcp_v4_rcv+0xef/0x960
[  219.888698]  ip_local_deliver_finish+0x49/0xd0
[  219.888700]  ip_local_deliver+0x5e/0xe0
[  219.888702]  ? ip_sublist_rcv_finish+0x50/0x50
[  219.888703]  ip_rcv+0x41/0xc0
[  219.888706]  __netif_receive_skb_one_core+0x4b/0x70
[  219.888708]  netif_receive_skb_internal+0x2f/0xd0
[  219.888710]  napi_gro_receive+0xb7/0xe0
[  219.888714]  mlx5e_handle_rx_cqe+0x7a/0xd0
[  219.888716]  mlx5e_poll_rx_cq+0xc6/0x930
[  219.888717]  mlx5e_napi_poll+0xab/0xc90
[  219.888722]  ? enqueue_task_fair+0x286/0xc40
[  219.888723]  ? enqueue_task_fair+0x1d6/0xc40
[  219.888725]  net_rx_action+0x1f1/0x320
[  219.888730]  __do_softirq+0xec/0x2b7
[  219.888736]  irq_exit+0x7b/0x80
[  219.888737]  do_IRQ+0x45/0xc0
[  219.888740]  common_interrupt+0xf/0xf
[  219.888742]  
[  219.888743] RIP: 0010:mwait_idle+0x5f/0x1b0
[  219.888745] Code: a8 01 0f 85 3f 01 00 00 31 d2 65 48 8b 04 25 80 
4c 01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 0f 85 40 01 00 00 31 c0 fb 
0f 01 c9 <65> 8b 2d 2a c9 6a 7e 0f 1f 44 00 00 65 48 8b 04 25 80 4c 
01 00 f0
[  219.888746] RSP: 0018:c900034e7eb8 EFLAGS: 0246 ORIG_RAX: 
ffde
[  219.888749] RAX:  RBX: 0034 RCX: 

[  219.888749] RDX:  RSI:  RDI: 

[  219.888750] RBP: 0034 R08: 003b R09: 
88086fa1fbc0
[  219.888751] R10:  R11: b15d R12: 
88086d18
[  219.888752] R13: 88086d18 R14:  R15: 


[  219.888754]  do_idle+0x1a3/0x1c0
[  219.888757]  cpu_startup_entry+0x14/0x20
[  219.888760]  start_secondary+0x165/0x190






Also some perf top attacked to this - 14G rx traffic on vlans (pktgen 
generated random destination ip's and forwarded by test server)


   PerfTop:   45296 irqs/sec  kernel:99.3%  exact:  0.0% [4000Hz 
cycles],  (all, 56 CPUs)

---

 7.43%  [kernel]   [k] mlx5e_skb_from_cqe_linear
 5.17%  [kernel]   [k] mlx5e_sq_xmit
 3.83%  [kernel]   [k] fib_table_lookup
 3.41%  [kernel]   [k] irq_entries_start
 2.91%  [kernel]   [k] build_skb
 2.50%  [kernel]   [k] mlx5_eq_int
 2.29%  [kernel]   [k] _raw_spin_lock
 2.27%  [kernel]   [k] tasklet_action_common.isra.21
 1.99%  [kernel]   [k] _raw_spin_lock_irqsave
 1.91%  [kernel]   [k] memcpy_erms

Re: Latest net-next kernel 4.19.0+

2018-10-29 Thread Paweł Staszewski


Sorry not complete - followed by hw csum:

[  342.190831] vlan1490: hw csum failure
[  342.190835] CPU: 52 PID: 0 Comm: swapper/52 Not tainted 4.19.0+ #1
[  342.190836] Call Trace:
[  342.190839]  
[  342.190849]  dump_stack+0x46/0x5b
[  342.190856]  __skb_checksum_complete+0x9a/0xa0
[  342.190859]  tcp_v4_rcv+0xef/0x960
[  342.190864]  ip_local_deliver_finish+0x49/0xd0
[  342.190866]  ip_local_deliver+0x5e/0xe0
[  342.190869]  ? ip_sublist_rcv_finish+0x50/0x50
[  342.190870]  ip_rcv+0x41/0xc0
[  342.190874]  __netif_receive_skb_one_core+0x4b/0x70
[  342.190877]  netif_receive_skb_internal+0x2f/0xd0
[  342.190879]  napi_gro_receive+0xb7/0xe0
[  342.190884]  mlx5e_handle_rx_cqe+0x7a/0xd0
[  342.190886]  mlx5e_poll_rx_cq+0xc6/0x930
[  342.190888]  mlx5e_napi_poll+0xab/0xc90
[  342.190893]  ? kmem_cache_free_bulk+0x1e4/0x280
[  342.190895]  net_rx_action+0x1f1/0x320
[  342.190901]  __do_softirq+0xec/0x2b7
[  342.190908]  irq_exit+0x7b/0x80
[  342.190910]  do_IRQ+0x45/0xc0
[  342.190912]  common_interrupt+0xf/0xf
[  342.190914]  
[  342.190916] RIP: 0010:mwait_idle+0x5f/0x1b0
[  342.190917] Code: a8 01 0f 85 3f 01 00 00 31 d2 65 48 8b 04 25 80 4c 
01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 0f 85 40 01 00 00 31 c0 fb 0f 01 
c9 <65> 8b 2d 2a c9 6a 7e 0f 1f 44 00 00 65 48 8b 04 25 80 4c 01 00 f0
[  342.190918] RSP: 0018:c900034e7eb8 EFLAGS: 0246 ORIG_RAX: 
ffdd
[  342.190920] RAX:  RBX: 0034 RCX: 

[  342.190921] RDX:  RSI:  RDI: 

[  342.190922] RBP: 0034 R08: 0057 R09: 
88086fa1fbc0
[  342.190923] R10:  R11: 000128cc R12: 
88086d18
[  342.190923] R13: 88086d18 R14:  R15: 


[  342.190929]  do_idle+0x1a3/0x1c0
[  342.190931]  cpu_startup_entry+0x14/0x20
[  342.190934]  start_secondary+0x165/0x190
[  342.190939]  secondary_startup_64+0xa4/0xb0


W dniu 30.10.2018 o 01:10, Paweł Staszewski pisze:

Hi


Just checked in test lab latest kernel and have weird traces:

[  219.888673] CPU: 52 PID: 0 Comm: swapper/52 Not tainted 4.19.0+ #1
[  219.888674] Call Trace:
[  219.888676]  
[  219.888685]  dump_stack+0x46/0x5b
[  219.888691]  __skb_checksum_complete+0x9a/0xa0
[  219.888694]  tcp_v4_rcv+0xef/0x960
[  219.888698]  ip_local_deliver_finish+0x49/0xd0
[  219.888700]  ip_local_deliver+0x5e/0xe0
[  219.888702]  ? ip_sublist_rcv_finish+0x50/0x50
[  219.888703]  ip_rcv+0x41/0xc0
[  219.888706]  __netif_receive_skb_one_core+0x4b/0x70
[  219.888708]  netif_receive_skb_internal+0x2f/0xd0
[  219.888710]  napi_gro_receive+0xb7/0xe0
[  219.888714]  mlx5e_handle_rx_cqe+0x7a/0xd0
[  219.888716]  mlx5e_poll_rx_cq+0xc6/0x930
[  219.888717]  mlx5e_napi_poll+0xab/0xc90
[  219.888722]  ? enqueue_task_fair+0x286/0xc40
[  219.888723]  ? enqueue_task_fair+0x1d6/0xc40
[  219.888725]  net_rx_action+0x1f1/0x320
[  219.888730]  __do_softirq+0xec/0x2b7
[  219.888736]  irq_exit+0x7b/0x80
[  219.888737]  do_IRQ+0x45/0xc0
[  219.888740]  common_interrupt+0xf/0xf
[  219.888742]  
[  219.888743] RIP: 0010:mwait_idle+0x5f/0x1b0
[  219.888745] Code: a8 01 0f 85 3f 01 00 00 31 d2 65 48 8b 04 25 80 
4c 01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 0f 85 40 01 00 00 31 c0 fb 
0f 01 c9 <65> 8b 2d 2a c9 6a 7e 0f 1f 44 00 00 65 48 8b 04 25 80 4c 01 
00 f0
[  219.888746] RSP: 0018:c900034e7eb8 EFLAGS: 0246 ORIG_RAX: 
ffde
[  219.888749] RAX:  RBX: 0034 RCX: 

[  219.888749] RDX:  RSI:  RDI: 

[  219.888750] RBP: 0034 R08: 003b R09: 
88086fa1fbc0
[  219.888751] R10:  R11: b15d R12: 
88086d18
[  219.888752] R13: 88086d18 R14:  R15: 


[  219.888754]  do_idle+0x1a3/0x1c0
[  219.888757]  cpu_startup_entry+0x14/0x20
[  219.888760]  start_secondary+0x165/0x190

Latest net-next kernel 4.19.0+

2018-10-29 Thread Paweł Staszewski


Hi


Just checked in test lab latest kernel and have weird traces:

[  219.888673] CPU: 52 PID: 0 Comm: swapper/52 Not tainted 4.19.0+ #1
[  219.888674] Call Trace:
[  219.888676]  
[  219.888685]  dump_stack+0x46/0x5b
[  219.888691]  __skb_checksum_complete+0x9a/0xa0
[  219.888694]  tcp_v4_rcv+0xef/0x960
[  219.888698]  ip_local_deliver_finish+0x49/0xd0
[  219.888700]  ip_local_deliver+0x5e/0xe0
[  219.888702]  ? ip_sublist_rcv_finish+0x50/0x50
[  219.888703]  ip_rcv+0x41/0xc0
[  219.888706]  __netif_receive_skb_one_core+0x4b/0x70
[  219.888708]  netif_receive_skb_internal+0x2f/0xd0
[  219.888710]  napi_gro_receive+0xb7/0xe0
[  219.888714]  mlx5e_handle_rx_cqe+0x7a/0xd0
[  219.888716]  mlx5e_poll_rx_cq+0xc6/0x930
[  219.888717]  mlx5e_napi_poll+0xab/0xc90
[  219.888722]  ? enqueue_task_fair+0x286/0xc40
[  219.888723]  ? enqueue_task_fair+0x1d6/0xc40
[  219.888725]  net_rx_action+0x1f1/0x320
[  219.888730]  __do_softirq+0xec/0x2b7
[  219.888736]  irq_exit+0x7b/0x80
[  219.888737]  do_IRQ+0x45/0xc0
[  219.888740]  common_interrupt+0xf/0xf
[  219.888742]  
[  219.888743] RIP: 0010:mwait_idle+0x5f/0x1b0
[  219.888745] Code: a8 01 0f 85 3f 01 00 00 31 d2 65 48 8b 04 25 80 4c 
01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 0f 85 40 01 00 00 31 c0 fb 0f 01 
c9 <65> 8b 2d 2a c9 6a 7e 0f 1f 44 00 00 65 48 8b 04 25 80 4c 01 00 f0
[  219.888746] RSP: 0018:c900034e7eb8 EFLAGS: 0246 ORIG_RAX: 
ffde
[  219.888749] RAX:  RBX: 0034 RCX: 

[  219.888749] RDX:  RSI:  RDI: 

[  219.888750] RBP: 0034 R08: 003b R09: 
88086fa1fbc0
[  219.888751] R10:  R11: b15d R12: 
88086d18
[  219.888752] R13: 88086d18 R14:  R15: 


[  219.888754]  do_idle+0x1a3/0x1c0
[  219.888757]  cpu_startup_entry+0x14/0x20
[  219.888760]  start_secondary+0x165/0x190

Re: after adding > 200vlans to mlx nic no traffic

2018-02-01 Thread Paweł Staszewski




W dniu 31.01.2018 o 13:19, Gal Pressman pisze:

On 30-Jan-18 17:57, Paweł Staszewski wrote:


W dniu 30.01.2018 o 15:57, Gal Pressman pisze:

On 30-Jan-18 02:29, Paweł Staszewski wrote:

Weird thing with mellanox mlx5 (connectx-4) kernel 4.15-rc9 - from net-next 
davem tree



after:

ip link add link enp175s0f1 name vlan1538 type vlan id 1538

ip link set up dev vlan1538


traffic on vlan is working


But after

VID="1160 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 
1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 
1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 
1498 1499 150
0 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 
1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 
1532 1534 1535 1394 1393 1550 1500 1526 1536 1537 1538 1539 1540 1542 1541 1543 
1544 1801 1546 1547 1548 1
549 1735 3132 3143 3104 3125 3103 3115 3134 3105 3113 3141 4009 3144 3130 1803 
3146 3148 3109 1551 1552 1553 1554 1555 1556 1558 1559 1560 1561 1562 1563 1564 
1565 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 
1582 1583 1584 1585 1586
   1587 1588 1589 1591 1592 1593 1594 1595 1596 1597 1598 1599 1557 1545 2001 
250 4043 1806 1600 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 
1614 1615 1616 1617 1618 1619 1620 1621 1625 1626 1627 1628 1629 1630 1631 1632 
1634 1635 1636 1640 1641 164
2 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 
1658 1659 1660 1661 1662 1663 1664 1665 1601 1666 1667 1668 1669 1670 1671 1672 
1673 1674 1676 1677 1678 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 
1691 1692 1693 1694 1696 1
697 1698 1712 1817 1869 1810 1814 1818 1855 1856 1857 1858 1859 1860 1861 1862 
1863 1864 1865 1866 1867 1868 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 
1880 1885 1890 1891 1892 1893 1894 1895 1898 1881 2190 2191 2192 2193 2194 2195 
2196 2197 2198 2199 2541
   2542 2543 2544 2545 2546 2547 2548 2549 2550 2290"
for i in $VID
do
  ip link add link enp175s0f1 name vlan$i type vlan id $i
done


And setting vlan 1538 up - there is no received traffic on this vlan.



So searching for broken things (last time same problem was with ixgbe)

ethtool -K enp175s0f1 rx-vlan-filter off


And all vlans attached to this device start working




Hi Pawel,
I tried to reproduce the issue in our local setups without success.
Can you please provide more information? are there any errors in dmesg? did you 
configure anything else that might be relevant to this issue?
Do you know if this is a new degradation to 4.15-rc9?

previous kernel used was 4.13.2 - without this problem.

current kernel is net-next 4.15.0-rc9+
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git


Try to send traffic over the vlans and sample the ethtool counters (ethtool -S 
enp175s0f1) of the receiver mlx5 interface over time,
this might help us trace where the packets drop.

Yes traffic is going out from interface - bot there is nothing on RX - tcpdump 
shows no packets arriving to interface


I am running 4.15.0-rc9+ from Dave's tree, currently on commit 91e6dd828425 ("ipmr: 
Fix ptrdiff_t print formatting").
Tested with the commands you provided and same configuration, the issue does 
not reproduce on our setups.

Did you see any errors in dmesg? anything coming from mlx5 driver?

No errors in dmesg

Which firmware version are you using?



Please provide your .config file, perhaps it is making the difference.
Ok maybee I will add also ethtool configuration that is started before 
ip link vlan is added:

ifc='enp175s0f0 enp175s0f1'
for i in $ifc
    do
    ip link set up dev $i
    ethtool -A $i autoneg off rx off tx off
    ethtool -G $i rx 4096 tx 4096
    ip link set $i txqueuelen 1000
    ethtool -L $i combined 28
    ethtool -N $i rx-flow-hash udp4 sdfn
    ethtool -C $i adaptive-rx off rx-usecs 256 rx-frames 128
    done

There are two interfaces
enp175s0f0 enp175s0f1

First one have also some vlans:
Below full list:
cat /proc/net/vlan/config
VLAN Dev name    | VLAN ID
Name-Type: VLAN_NAME_TYPE_RAW_PLUS_VID_NO_PAD
vlan1538   | 1538  | enp175s0f1
vlan1160   | 1160  | enp175s0f1
vlan1450   | 1450  | enp175s0f1
vlan1451   | 1451  | enp175s0f1
vlan1452   | 1452  | enp175s0f1
vlan1453   | 1453  | enp175s0f1
vlan1454   | 1454  | enp175s0f1
vlan1455   | 1455  | enp175s0f1
vlan1456   | 1456  | enp175s0f1
vlan1457   | 1457  | enp175s0f1
vlan1458   | 1458  | enp175s0f1
vlan1459   | 1459  | enp175s0f1
vlan1460   | 1460  | enp175s0f1
vlan1461   | 1461  | enp175s0f1
vlan1462   | 1462  | enp175s0f1
vlan1463   | 1463  | enp175s0f1
vlan1464   | 1464  | enp175s0f1
vlan1465   | 1465  | enp175s0f1
vlan1466   | 1466  | enp175s0f1
vlan1467   | 1467  | enp175s0f1

Re: after adding > 200vlans to mlx nic no traffic

2018-01-30 Thread Paweł Staszewski




W dniu 30.01.2018 o 15:57, Gal Pressman pisze:

On 30-Jan-18 02:29, Paweł Staszewski wrote:

Weird thing with mellanox mlx5 (connectx-4) kernel 4.15-rc9 - from net-next 
davem tree



after:

ip link add link enp175s0f1 name vlan1538 type vlan id 1538

ip link set up dev vlan1538


traffic on vlan is working


But after

VID="1160 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 
1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 
1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 
1498 1499 150
0 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 
1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 
1532 1534 1535 1394 1393 1550 1500 1526 1536 1537 1538 1539 1540 1542 1541 1543 
1544 1801 1546 1547 1548 1
549 1735 3132 3143 3104 3125 3103 3115 3134 3105 3113 3141 4009 3144 3130 1803 
3146 3148 3109 1551 1552 1553 1554 1555 1556 1558 1559 1560 1561 1562 1563 1564 
1565 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 
1582 1583 1584 1585 1586
  1587 1588 1589 1591 1592 1593 1594 1595 1596 1597 1598 1599 1557 1545 2001 
250 4043 1806 1600 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 
1614 1615 1616 1617 1618 1619 1620 1621 1625 1626 1627 1628 1629 1630 1631 1632 
1634 1635 1636 1640 1641 164
2 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 
1658 1659 1660 1661 1662 1663 1664 1665 1601 1666 1667 1668 1669 1670 1671 1672 
1673 1674 1676 1677 1678 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 
1691 1692 1693 1694 1696 1
697 1698 1712 1817 1869 1810 1814 1818 1855 1856 1857 1858 1859 1860 1861 1862 
1863 1864 1865 1866 1867 1868 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 
1880 1885 1890 1891 1892 1893 1894 1895 1898 1881 2190 2191 2192 2193 2194 2195 
2196 2197 2198 2199 2541
  2542 2543 2544 2545 2546 2547 2548 2549 2550 2290"
for i in $VID
do
     ip link add link enp175s0f1 name vlan$i type vlan id $i
done


And setting vlan 1538 up - there is no received traffic on this vlan.



So searching for broken things (last time same problem was with ixgbe)

ethtool -K enp175s0f1 rx-vlan-filter off


And all vlans attached to this device start working




Hi Pawel,
I tried to reproduce the issue in our local setups without success.
Can you please provide more information? are there any errors in dmesg? did you 
configure anything else that might be relevant to this issue?
Do you know if this is a new degradation to 4.15-rc9?

previous kernel used was 4.13.2 - without this problem.

current kernel is net-next 4.15.0-rc9+
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git


Try to send traffic over the vlans and sample the ethtool counters (ethtool -S 
enp175s0f1) of the receiver mlx5 interface over time,
this might help us trace where the packets drop.
Yes traffic is going out from interface - bot there is nothing on RX - 
tcpdump shows no packets arriving to interface






Thank you for reporting this,
Gal



Interface settings:
(working case with rx vlan filter turned off)
ethtool -k enp175s0f1
Features for enp175s0f1:
rx-checksumming: on
tx-checksumming: on
    tx-checksum-ipv4: on
    tx-checksum-ip-generic: off [fixed]
    tx-checksum-ipv6: on
    tx-checksum-fcoe-crc: off [fixed]
    tx-checksum-sctp: off [fixed]
scatter-gather: on
    tx-scatter-gather: on
    tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
    tx-tcp-segmentation: on
    tx-tcp-ecn-segmentation: off [fixed]
    tx-tcp-mangleid-segmentation: off
    tx-tcp6-segmentation: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: off
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: off [fixed]
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off
rx-all: off
tx-vlan-stag-hw-insert: on
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: on [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: on
rx-gro-hw: off [fixed]

Coalesce parameters for enp175s0f1:
Adaptive RX: off  TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0
dmac: 32571

rx-usecs: 256
rx-frames: 128
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 16
tx-frames: 32
tx-usecs-irq: 0
tx-frames-irq: 0

rx

after adding > 200vlans to mlx nic no traffic

2018-01-29 Thread Paweł Staszewski

Weird thing with mellanox mlx5 (connectx-4) kernel 4.15-rc9 - from 
net-next davem tree




after:

ip link add link enp175s0f1 name vlan1538 type vlan id 1538

ip link set up dev vlan1538


traffic on vlan is working


But after

VID="1160 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 
1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 
1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 
1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 150
0 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 
1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 
1529 1530 1531 1532 1534 1535 1394 1393 1550 1500 1526 1536 1537 1538 
1539 1540 1542 1541 1543 1544 1801 1546 1547 1548 1
549 1735 3132 3143 3104 3125 3103 3115 3134 3105 3113 3141 4009 3144 
3130 1803 3146 3148 3109 1551 1552 1553 1554 1555 1556 1558 1559 1560 
1561 1562 1563 1564 1565 1567 1568 1569 1570 1571 1572 1573 1574 1575 
1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586
 1587 1588 1589 1591 1592 1593 1594 1595 1596 1597 1598 1599 1557 1545 
2001 250 4043 1806 1600 1602 1603 1604 1605 1606 1607 1608 1609 1610 
1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1625 1626 1627 
1628 1629 1630 1631 1632 1634 1635 1636 1640 1641 164
2 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 
1657 1658 1659 1660 1661 1662 1663 1664 1665 1601 1666 1667 1668 1669 
1670 1671 1672 1673 1674 1676 1677 1678 1680 1681 1682 1683 1684 1685 
1686 1687 1688 1689 1690 1691 1692 1693 1694 1696 1
697 1698 1712 1817 1869 1810 1814 1818 1855 1856 1857 1858 1859 1860 
1861 1862 1863 1864 1865 1866 1867 1868 1870 1871 1872 1873 1874 1875 
1876 1877 1878 1879 1880 1885 1890 1891 1892 1893 1894 1895 1898 1881 
2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2541

 2542 2543 2544 2545 2546 2547 2548 2549 2550 2290"
for i in $VID
do
    ip link add link enp175s0f1 name vlan$i type vlan id $i
done


And setting vlan 1538 up - there is no received traffic on this vlan.



So searching for broken things (last time same problem was with ixgbe)

ethtool -K enp175s0f1 rx-vlan-filter off


And all vlans attached to this device start working

xdp_router_ipv4 mellanox problem

2018-01-29 Thread Paweł Staszewski


Hi

Want to do some tests with xdp_router on two 100G physical interfaces but:

Jan 29 17:00:40 HOST kernel: mlx5_core :af:00.0: MLX5E: StrdRq(0) 
RqSz(1024) StrdSz(1) RxCqeCmprss(0)

Jan 29 17:00:40 HOST kernel: mlx5_core :af:00.0 enp175s0f0: Link up
Jan 29 17:00:41 HOST kernel: mlx5_core :af:00.1: MLX5E: StrdRq(0) 
RqSz(1024) StrdSz(1) RxCqeCmprss(0)

Jan 29 17:00:41 HOST kernel: mlx5_core :af:00.1 enp175s0f1: Link up
Jan 29 17:00:41 HOST kernel: [ cut here ]
Jan 29 17:00:41 HOST kernel: Driver unsupported XDP return value 4, 
expect packet loss!
Jan 29 17:00:41 HOST kernel: WARNING: CPU: 43 PID: 0 at 
net/core/filter.c:3901 bpf_warn_invalid_xdp_action+0x34/0x40

Jan 29 17:00:41 HOST kernel: Modules linked in: x86_pkg_temp_thermal ipmi_si
Jan 29 17:00:41 HOST kernel: CPU: 43 PID: 0 Comm: swapper/43 Not tainted 
4.15.0-rc9+ #1

Jan 29 17:00:41 HOST kernel: RIP: 0010:bpf_warn_invalid_xdp_action+0x34/0x40
Jan 29 17:00:41 HOST kernel: RSP: 0018:88087f9c3dc8 EFLAGS: 00010296
Jan 29 17:00:41 HOST kernel: RAX: 003a RBX: 88081ea38000 
RCX: 0006
Jan 29 17:00:41 HOST kernel: RDX: 0007 RSI: 0092 
RDI: 88087f9d53d0
Jan 29 17:00:41 HOST kernel: RBP: 88087f9c3e58 R08: 0001 
R09: 0536
Jan 29 17:00:41 HOST kernel: R10: 0004 R11: 0536 
R12: 8808304d3000
Jan 29 17:00:41 HOST kernel: R13: 02c0 R14: 88081e53c000 
R15: c907d000
Jan 29 17:00:41 HOST kernel: FS:  () 
GS:88087f9c() knlGS:
Jan 29 17:00:41 HOST kernel: CS:  0010 DS:  ES:  CR0: 
80050033
Jan 29 17:00:41 HOST kernel: CR2: 02038648 CR3: 0220a002 
CR4: 007606e0
Jan 29 17:00:41 HOST kernel: DR0:  DR1:  
DR2: 
Jan 29 17:00:41 HOST kernel: DR3:  DR6: fffe0ff0 
DR7: 0400

Jan 29 17:00:41 HOST kernel: PKRU: 5554
Jan 29 17:00:41 HOST kernel: Call Trace:
Jan 29 17:00:41 HOST kernel:  
Jan 29 17:00:41 HOST kernel:  mlx5e_handle_rx_cqe+0x279/0x900
Jan 29 17:00:41 HOST kernel:  mlx5e_poll_rx_cq+0xb3/0x860
Jan 29 17:00:41 HOST kernel:  mlx5e_napi_poll+0x81/0x6f0
Jan 29 17:00:41 HOST kernel:  ? mlx5_cq_completion+0x4d/0xb0
Jan 29 17:00:41 HOST kernel:  net_rx_action+0x1cd/0x2f0
Jan 29 17:00:41 HOST kernel:  __do_softirq+0xe4/0x275
Jan 29 17:00:41 HOST kernel:  irq_exit+0x6b/0x70
Jan 29 17:00:41 HOST kernel:  do_IRQ+0x45/0xc0
Jan 29 17:00:41 HOST kernel:  common_interrupt+0x95/0x95
Jan 29 17:00:41 HOST kernel:  
Jan 29 17:00:41 HOST kernel: RIP: 0010:mwait_idle+0x59/0x160
Jan 29 17:00:41 HOST kernel: RSP: 0018:c90003497ef8 EFLAGS: 0246 
ORIG_RAX: ffdd
Jan 29 17:00:41 HOST kernel: RAX:  RBX: 002b 
RCX: 
Jan 29 17:00:41 HOST kernel: RDX:  RSI:  
RDI: 
Jan 29 17:00:41 HOST kernel: RBP: 002b R08: 1000 
R09: 
Jan 29 17:00:41 HOST kernel: R10:  R11: 000100130e40 
R12: 88086d165000
Jan 29 17:00:41 HOST kernel: R13: 88086d165000 R14:  
R15: 

Jan 29 17:00:41 HOST kernel:  do_idle+0x14e/0x160
Jan 29 17:00:41 HOST kernel:  cpu_startup_entry+0x14/0x20
Jan 29 17:00:41 HOST kernel:  secondary_startup_64+0xa5/0xb0
Jan 29 17:00:41 HOST kernel: Code: c3 83 ff 04 48 c7 c0 1a cf 10 82 89 
fa c6 05 9a df b4 00 01 48 c7 c6 22 cf 10 82 48 c7 c7 38 cf 10 82 48 0f 
47 f0 e8 ec 19 8b ff <0f> ff c3 66 0f 1f 84 00 00 00 00 00 81 fe ff ff 
00 00 55 48 89

Jan 29 17:00:41 HOST kernel: ---[ end trace 2b255fac8d0824de ]---


I can attach xdp_router_ipv4 to any vlan interface without crash

./xdp_router_ipv4 vlan4032

**loading bpf file*


Attached to 8
***ROUTE TABLE*


NEW Route entry
Destination Gateway Genmask Metric  Iface
192.168.32.0  0 24 0   vlan4032
***ARP TABLE***


Address HwAddress
7920a8c0    8da6fb902500
120a8c0 44fc9e0c5e4c



But after attaching to physical interface there is "above trace".



Thanks

Paweł

Re: kernel 4.15.0-rc9+ (net-next) high cpu load at 50Gbit/s - about 6Mpps

2018-01-28 Thread Paweł Staszewski




W dniu 27.01.2018 o 23:23, Paweł Staszewski pisze:

Hi


Today I made some real life traffic tests with kernel 4.15.0-rc9

but when traffic reach 50Gbit/s and about 6Mpps cpou load rises fast 
from 48% to 100% for all cpu cores.


Here is some graph that presenting how cpu load rises when there was 
more pps.



https://ibb.co/mhD5ob


here is perf record from that time:

https://pastebin.com/3zqG1rvE


There is 8x 10G ixgbe 82599 interfaces teamed with teamd.

No traffic queueing - only pfifo fast on all interfaces.

No NAT or iptables forles other than INPUT (about 30rules)

All nic's have same ethtool settings:

ethtool -k eth0
Features for eth0:
Cannot get device udp-fragmentation-offload settings: Operation not 
supported

rx-checksumming: on
tx-checksumming: on
    tx-checksum-ipv4: off [fixed]
    tx-checksum-ip-generic: on
    tx-checksum-ipv6: off [fixed]
    tx-checksum-fcoe-crc: off [fixed]
    tx-checksum-sctp: on
scatter-gather: on
    tx-scatter-gather: on
    tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
    tx-tcp-segmentation: on
    tx-tcp-ecn-segmentation: off [fixed]
    tx-tcp-mangleid-segmentation: off
    tx-tcp6-segmentation: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: on
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: on
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off
hw-tc-offload: off
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: on


ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX: 4096
RX Mini:    0
RX Jumbo:   0
TX: 4096
Current hardware settings:
RX: 4096
RX Mini:    0
RX Jumbo:   0
TX: 2048


ethtool -c eth0
Coalesce parameters for eth0:
Adaptive RX: off  TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 512
rx-frames: 0
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 0
tx-frames: 0
tx-usecs-irq: 0
tx-frames-irq: 0

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0








Peft top for kernel 4.15.0-rc9 below (all 40 cores 100% cpu load with 
6.3Mpps)


    20.96%  [kernel]    [k] queued_spin_lock_slowpath
 5.51%  [kernel]    [k] ixgbe_poll
 5.49%  [kernel]    [k] ixgbe_xmit_frame_ring
 4.39%  [kernel]    [k] do_raw_spin_lock
 4.29%  [kernel]    [k] sch_direct_xmit
 4.11%  [kernel]    [k] fib_table_lookup
 3.11%  [team_mode_roundrobin]  [k] rr_transmit
 2.71%  [kernel]    [k] __dev_queue_xmit
 2.62%  [kernel]    [k] __ptr_ring_peek
 2.39%  [kernel]    [k] skb_release_data
 2.18%  [kernel]    [k] dev_gro_receive
 1.75%  [kernel]    [k] __qdisc_run
 1.67%  [kernel]    [k] pfifo_fast_enqueue
 1.57%  [kernel]    [k] netdev_pick_tx
 1.56%  [kernel]    [k] page_frag_free
 1.48%  [kernel]    [k] ip_finish_output2
 1.38%  [kernel]    [k] __slab_free
 1.36%  [kernel]    [k] skb_unref
 1.34%  [kernel]    [k] ixgbe_maybe_stop_tx
 1.30%  [kernel]    [k] vlan_do_receive
 1.28%  [kernel]    [k] pfifo_fast_dequeue
 1.23%  [kernel]    [k] virt_to_head_page



Same configuration kernel 4.15.0-rc3 (50% cpu load on all 40 cores with 
6.3Mpps)


 7.81%  [kernel]    [k] ixgbe_xmit_frame_ring
 7.61%  [kernel]    [k] ixgbe_poll
 7.09%  [kernel]    [k] do_raw_spin_lock
 5.63%  [kernel]    [k] fib_table_lookup
 5.19%  [kernel]    [k] __dev_queue_xmit
 4.38%  [team_mode_roundrobin]  [k] rr_transmit
 3.10%  [kernel]    [k] netdev_pick_tx
 2.79%  [kernel]    [k] skb_release_data
 2.34%  [kernel]    [k] dev_gro_receive
 1.99%  [kernel]    [k] page_frag_free
 1.96%  [kernel]    [k] skb_unref
 1.92%  [kernel]    [k] virt_to_head_page
 1.90%  [kernel]    [k] ixgbe_maybe_st

kernel 4.15.0-rc9+ (net-next) high cpu load at 50Gbit/s - about 6Mpps

2018-01-27 Thread Paweł Staszewski


Hi


Today I made some real life traffic tests with kernel 4.15.0-rc9

but when traffic reach 50Gbit/s and about 6Mpps cpou load rises fast 
from 48% to 100% for all cpu cores.


Here is some graph that presenting how cpu load rises when there was 
more pps.



https://ibb.co/mhD5ob


here is perf record from that time:

https://pastebin.com/3zqG1rvE


There is 8x 10G ixgbe 82599 interfaces teamed with teamd.

No traffic queueing - only pfifo fast on all interfaces.

No NAT or iptables forles other than INPUT (about 30rules)

All nic's have same ethtool settings:

ethtool -k eth0
Features for eth0:
Cannot get device udp-fragmentation-offload settings: Operation not 
supported

rx-checksumming: on
tx-checksumming: on
    tx-checksum-ipv4: off [fixed]
    tx-checksum-ip-generic: on
    tx-checksum-ipv6: off [fixed]
    tx-checksum-fcoe-crc: off [fixed]
    tx-checksum-sctp: on
scatter-gather: on
    tx-scatter-gather: on
    tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
    tx-tcp-segmentation: on
    tx-tcp-ecn-segmentation: off [fixed]
    tx-tcp-mangleid-segmentation: off
    tx-tcp6-segmentation: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: on
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: on
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off
hw-tc-offload: off
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: on


ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX: 4096
RX Mini:    0
RX Jumbo:   0
TX: 4096
Current hardware settings:
RX: 4096
RX Mini:    0
RX Jumbo:   0
TX: 2048


ethtool -c eth0
Coalesce parameters for eth0:
Adaptive RX: off  TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 512
rx-frames: 0
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 0
tx-frames: 0
tx-usecs-irq: 0
tx-frames-irq: 0

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0

Re: Huge memory leak with 4.15.0-rc2+

2017-12-12 Thread Paweł Staszewski




W dniu 2017-12-11 o 23:27, Paweł Staszewski pisze:



W dniu 2017-12-11 o 23:15, John Fastabend pisze:

On 12/11/2017 01:48 PM, Paweł Staszewski wrote:


W dniu 2017-12-11 o 22:23, Paweł Staszewski pisze:

Hi


I just upgraded some testing host to 4.15.0-rc2+ kernel

And after some time of traffic processing - when traffic on all ports
reach about 3Mpps - memleak started.



[...]


Some observations - when i disable tso on all cards there is more
memleak.






When traffic starts to drop - there is less and less memleak
below link to memory usage graph:
https://ibb.co/hU97kG

And there is rising slab_unrecl - Amount of unreclaimable memory used
for slab kernel allocations


Forgot to add that im using hfsc and qdiscs like pfifo on classes.



Maybe some error case I missed in the qdisc patches I'm looking into
it.

Thanks,
John



This is how it looks like when corelated on graph - traffic vs mem
https://ibb.co/njpkqG

Typical hfsc class + qdisc:
### Client interface vlan1616
tc qdisc del dev vlan1616 root
tc qdisc add dev vlan1616 handle 1: root hfsc default 100
tc class add dev vlan1616 parent 1: classid 1:100 hfsc ls m2 200Mbit 
ul m2 200Mbit

tc qdisc add dev vlan1616 parent 1:100 handle 100: pfifo limit 128
### End TM for client interface
tc qdisc del dev vlan1616 ingress
tc qdisc add dev vlan1616 handle : ingress
tc filter add dev vlan1616 parent : protocol ip prio 50 u32 match 
ip src 0.0.0.0/0 police rate 200Mbit burst 200M mtu 32k drop flowid 1:1


And this is same for about 450 vlan interfaces


Good thing is that compared to 4.14.3 i have about 5% less cpu load on 
4.15.0-rc2+


When hfsc will be lockless or tbf - then it will be really huge 
difference in cpu load on x86 when using traffic shaping - so really 
good job John.







Yestarday changed kernel from
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git

to

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=v4.15-rc3


And there is no memleak.
So yes probabbly lockless qdisc patches

Re: Huge memory leak with 4.15.0-rc2+

2017-12-11 Thread Paweł Staszewski




W dniu 2017-12-11 o 23:15, John Fastabend pisze:

On 12/11/2017 01:48 PM, Paweł Staszewski wrote:


W dniu 2017-12-11 o 22:23, Paweł Staszewski pisze:

Hi


I just upgraded some testing host to 4.15.0-rc2+ kernel

And after some time of traffic processing - when traffic on all ports
reach about 3Mpps - memleak started.



[...]


Some observations - when i disable tso on all cards there is more
memleak.






When traffic starts to drop - there is less and less memleak
below link to memory usage graph:
https://ibb.co/hU97kG

And there is rising slab_unrecl - Amount of unreclaimable memory used
for slab kernel allocations


Forgot to add that im using hfsc and qdiscs like pfifo on classes.



Maybe some error case I missed in the qdisc patches I'm looking into
it.

Thanks,
John



This is how it looks like when corelated on graph - traffic vs mem
https://ibb.co/njpkqG

Typical hfsc class + qdisc:
### Client interface vlan1616
tc qdisc del dev vlan1616 root
tc qdisc add dev vlan1616 handle 1: root hfsc default 100
tc class add dev vlan1616 parent 1: classid 1:100 hfsc ls m2 200Mbit ul 
m2 200Mbit

tc qdisc add dev vlan1616 parent 1:100 handle 100: pfifo limit 128
### End TM for client interface
tc qdisc del dev vlan1616 ingress
tc qdisc add dev vlan1616 handle : ingress
tc filter add dev vlan1616 parent : protocol ip prio 50 u32 match ip 
src 0.0.0.0/0 police rate 200Mbit burst 200M mtu 32k drop flowid 1:1


And this is same for about 450 vlan interfaces


Good thing is that compared to 4.14.3 i have about 5% less cpu load on 
4.15.0-rc2+


When hfsc will be lockless or tbf - then it will be really huge 
difference in cpu load on x86 when using traffic shaping - so really 
good job John.

Re: Huge memory leak with 4.15.0-rc2+

2017-12-11 Thread Paweł Staszewski




W dniu 2017-12-11 o 22:23, Paweł Staszewski pisze:

Hi


I just upgraded some testing host to 4.15.0-rc2+ kernel

And after some time of traffic processing - when traffic on all ports 
reach about 3Mpps - memleak started.


Graph attached from memory usage: https://ibb.co/idK4zb



HW config:

Intel E5

8x Intel 82599 (used ixgbe driver from kernel)

Interfaces with vlans attached

All 8 ethernet ports are in one LAG group configured by team.

With current settings

(this host is acting as a router - and bgpd process is eating same 
amount of memory from the beginning about 5.2GB)


 cat /proc/meminfo
MemTotal:   32770588 kB
MemFree:    11342492 kB
MemAvailable:   10982752 kB
Buffers:   84704 kB
Cached:    83180 kB
SwapCached:    0 kB
Active:  5105320 kB
Inactive:  46252 kB
Active(anon):    4985448 kB
Inactive(anon): 1096 kB
Active(file): 119872 kB
Inactive(file):    45156 kB
Unevictable:   0 kB
Mlocked:   0 kB
SwapTotal:   4005280 kB
SwapFree:    4005280 kB
Dirty:   236 kB
Writeback: 0 kB
AnonPages:   4983752 kB
Mapped:    13556 kB
Shmem:  2852 kB
Slab:    1013124 kB
SReclaimable:  45876 kB
SUnreclaim:   967248 kB
KernelStack:    7152 kB
PageTables:    12164 kB
NFS_Unstable:  0 kB
Bounce:    0 kB
WritebackTmp:  0 kB
CommitLimit:    20390572 kB
Committed_AS: 396568 kB
VmallocTotal:   34359738367 kB
VmallocUsed:   0 kB
VmallocChunk:  0 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
ShmemHugePages:    0 kB
ShmemPmdMapped:    0 kB
CmaTotal:  0 kB
CmaFree:   0 kB
HugePages_Total:   0
HugePages_Free:    0
HugePages_Rsvd:    0
HugePages_Surp:    0
Hugepagesize:   2048 kB
DirectMap4k: 1407572 kB
DirectMap2M:    20504576 kB
DirectMap1G:    13631488 kB

ps aux --sort -rss
USER   PID %CPU %MEM    VSZ   RSS TTY  STAT START   TIME COMMAND
root  6758  1.8 14.9 5044996 4886964 ? Sl   01:22  23:21 
/usr/local/sbin/bgpd -d  -u root -g root -I --ignore_warnings
root  6752  0.0  0.1  86272 61920 ?    Ss   01:22   0:16 
/usr/local/sbin/zebra -d  -u root -g root -I --ignore_warnings
root  6766 12.6  0.0  51592 29196 ?    S    01:22 157:48 
/usr/sbin/snmpd -p /var/run/snmpd.pid -Ln
root  7494  0.0  0.0 708976  5896 ?    Ssl  01:22   0:09 
/opt/collectd/sbin/collectd
root 15531  0.0  0.0  67864  5056 ?    Ss   21:57   0:00 sshd: 
paol [priv]
root  4915  0.0  0.0 271912  4904 ?    Ss   01:21   0:25 
/usr/sbin/syslog-ng --persist-file 
/var/lib/syslog-ng/syslog-ng.persist --cfgfile 
/etc/syslog-ng/syslog-ng.conf --pidfile /run/syslog-ng.pid
root  4278  0.0  0.0  37220  4164 ?    Ss   01:21   0:00 
/lib/systemd/systemd-udevd --daemon
root  5147  0.0  0.0  32072  3232 ?    Ss   01:21   0:00 
/usr/sbin/sshd
root  5203  0.0  0.0  28876  2436 ?    S    01:21   0:00 teamd 
-d -f /etc/teamd.conf
root 17372  0.0  0.0  17924  2388 pts/2    R+   22:13   0:00 ps 
aux --sort -rss
root  4789  0.0  0.0   5032  2176 ?    Ss   01:21   0:00 mdadm 
--monitor --scan --daemonise --pid-file /var/run/mdadm.pid --syslog
root  7511  0.0  0.0  12676  1920 tty4 Ss+  01:22   0:00 
/sbin/agetty 38400 tty4 linux
root  7510  0.0  0.0  12676  1896 tty3 Ss+  01:22   0:00 
/sbin/agetty 38400 tty3 linux
root  7512  0.0  0.0  12676  1860 tty5 Ss+  01:22   0:00 
/sbin/agetty 38400 tty5 linux
root  7513  0.0  0.0  12676  1836 tty6 Ss+  01:22   0:00 
/sbin/agetty 38400 tty6 linux
root  7509  0.0  0.0  12676  1832 tty2 Ss+  01:22   0:00 
/sbin/agetty 38400 tty2 linux


And latest kernel that everything was working is: 4.14.3


Some observations - when i disable tso on all cards there is more 
memleak.







When traffic starts to drop - there is less and less memleak
below link to memory usage graph:
https://ibb.co/hU97kG

And there is rising slab_unrecl - Amount of unreclaimable memory used 
for slab kernel allocations



Forgot to add that im using hfsc and qdiscs like pfifo on classes.

Huge memory leak with 4.15.0-rc2+

2017-12-11 Thread Paweł Staszewski


Hi


I just upgraded some testing host to 4.15.0-rc2+ kernel

And after some time of traffic processing - when traffic on all ports 
reach about 3Mpps - memleak started.


Graph attached from memory usage: https://ibb.co/idK4zb



HW config:

Intel E5

8x Intel 82599 (used ixgbe driver from kernel)

Interfaces with vlans attached

All 8 ethernet ports are in one LAG group configured by team.

With current settings

(this host is acting as a router - and bgpd process is eating same 
amount of memory from the beginning about 5.2GB)


 cat /proc/meminfo
MemTotal:   32770588 kB
MemFree:    11342492 kB
MemAvailable:   10982752 kB
Buffers:   84704 kB
Cached:    83180 kB
SwapCached:    0 kB
Active:  5105320 kB
Inactive:  46252 kB
Active(anon):    4985448 kB
Inactive(anon): 1096 kB
Active(file): 119872 kB
Inactive(file):    45156 kB
Unevictable:   0 kB
Mlocked:   0 kB
SwapTotal:   4005280 kB
SwapFree:    4005280 kB
Dirty:   236 kB
Writeback: 0 kB
AnonPages:   4983752 kB
Mapped:    13556 kB
Shmem:  2852 kB
Slab:    1013124 kB
SReclaimable:  45876 kB
SUnreclaim:   967248 kB
KernelStack:    7152 kB
PageTables:    12164 kB
NFS_Unstable:  0 kB
Bounce:    0 kB
WritebackTmp:  0 kB
CommitLimit:    20390572 kB
Committed_AS: 396568 kB
VmallocTotal:   34359738367 kB
VmallocUsed:   0 kB
VmallocChunk:  0 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
ShmemHugePages:    0 kB
ShmemPmdMapped:    0 kB
CmaTotal:  0 kB
CmaFree:   0 kB
HugePages_Total:   0
HugePages_Free:    0
HugePages_Rsvd:    0
HugePages_Surp:    0
Hugepagesize:   2048 kB
DirectMap4k: 1407572 kB
DirectMap2M:    20504576 kB
DirectMap1G:    13631488 kB

ps aux --sort -rss
USER   PID %CPU %MEM    VSZ   RSS TTY  STAT START   TIME COMMAND
root  6758  1.8 14.9 5044996 4886964 ? Sl   01:22  23:21 
/usr/local/sbin/bgpd -d  -u root -g root -I --ignore_warnings
root  6752  0.0  0.1  86272 61920 ?    Ss   01:22   0:16 
/usr/local/sbin/zebra -d  -u root -g root -I --ignore_warnings
root  6766 12.6  0.0  51592 29196 ?    S    01:22 157:48 
/usr/sbin/snmpd -p /var/run/snmpd.pid -Ln
root  7494  0.0  0.0 708976  5896 ?    Ssl  01:22   0:09 
/opt/collectd/sbin/collectd
root 15531  0.0  0.0  67864  5056 ?    Ss   21:57   0:00 sshd: 
paol [priv]
root  4915  0.0  0.0 271912  4904 ?    Ss   01:21   0:25 
/usr/sbin/syslog-ng --persist-file /var/lib/syslog-ng/syslog-ng.persist 
--cfgfile /etc/syslog-ng/syslog-ng.conf --pidfile /run/syslog-ng.pid
root  4278  0.0  0.0  37220  4164 ?    Ss   01:21   0:00 
/lib/systemd/systemd-udevd --daemon
root  5147  0.0  0.0  32072  3232 ?    Ss   01:21   0:00 
/usr/sbin/sshd
root  5203  0.0  0.0  28876  2436 ?    S    01:21   0:00 teamd 
-d -f /etc/teamd.conf
root 17372  0.0  0.0  17924  2388 pts/2    R+   22:13   0:00 ps aux 
--sort -rss
root  4789  0.0  0.0   5032  2176 ?    Ss   01:21   0:00 mdadm 
--monitor --scan --daemonise --pid-file /var/run/mdadm.pid --syslog
root  7511  0.0  0.0  12676  1920 tty4 Ss+  01:22   0:00 
/sbin/agetty 38400 tty4 linux
root  7510  0.0  0.0  12676  1896 tty3 Ss+  01:22   0:00 
/sbin/agetty 38400 tty3 linux
root  7512  0.0  0.0  12676  1860 tty5 Ss+  01:22   0:00 
/sbin/agetty 38400 tty5 linux
root  7513  0.0  0.0  12676  1836 tty6 Ss+  01:22   0:00 
/sbin/agetty 38400 tty6 linux
root  7509  0.0  0.0  12676  1832 tty2 Ss+  01:22   0:00 
/sbin/agetty 38400 tty2 linux


And latest kernel that everything was working is: 4.14.3


Some observations - when i disable tso on all cards there is more memleak.

Re: [Intel-wired-lan] Instability of i40e driver on 4.9 kernel


e1000 sourceforge is a bad place to make anything with Your problems


Just checked this now if something changed :)

But when I post reply to some bug that i was have same problem somebody 
closed the ticket and delete my message :)



So rly :)




W dniu 2017-10-25 o 23:49, Pavlos Parissis pisze:

On 21/10/2017 02:07 πμ, Fujinaka, Todd wrote:

You picked a bunch of places to post this, and you really should've used a
different place: e1000-de...@lists.sourceforge.net


Just subscribed to that ML and mailed about it.


Also, since you flagged the "communities" post as "answered", you're not likely
to get any follow-up. The Intel communities are also not monitored as much by
the wired networking people at Intel.

I don't see it as "answered" when I visit the page, maybe the fact I have 
replied
with extra information is confusing something. Anyway, it isn't that important
since the "communities" posts aren't monitored by Intel people, which makes 
sense
as it is very time consuming to monitor mailing lists and web forums at the 
same time.


Please let us know if you have any specific issues, and please provide exact
reproduction steps so we can investigate your issues, and please use
e1000-devel.

I hope the information I provided in my mail to e1000-devel is enough.

Thanks,
Pavlos

Re: intel i40e buggy driver question




W dniu 2017-10-28 o 00:34, Paweł Staszewski pisze:

Hi




I have many problems with 40e driver

memleaks , kernel panics , stack traces , tx hungx , tx timeouts and 
many many others :)



But the main problem that can't be resolved in linux is resolved in 
freebsd


problem in freebsd with this:

[2501243.181829] i40e :01:00.1 eno2: VSI_seid 390, Hung TX queue 
17, tx_pending_hw: 1, NTC:0x16b, HWB: 0x16b, NTU: 0x16c, TAIL: 0x16c
[2501243.181835] i40e :01:00.1 eno2: VSI_seid 390, Issuing 
force_wb for TX queue 17, Interrupt Reg: 0x0



Was solved by this:


"

change this piece in ixl_tso_detect_sparse() in ixl_txrx.c:

    if (mss < 1) {
    if (num > IXL_SPARSE_CHAIN)
    return (true);
    num = (mss == 0) ? 0 : 1;
    mss += mp->m_pkthdr.tso_segsz;
    }

to

    if (num > IXL_SPARSE_CHAIN)
    return (true);
    if (mss < 1) {
    num = (mss == 0) ? 0 : 1;
    mss += mp->m_pkthdr.tso_segsz;
    }

Intel FreeBSD Team: This will definitely prevent MDDs on the buffers 
you sent me.


"


An I have a question - how to do the same in linux ? :)

Cause i have same problem in Linux with this i40e buggy driver:

[224051.287277] WARNING: CPU: 3 PID: 25031 at 
drivers/net/ethernet/intel/i40e/i40e_txrx.c:1248 
i40e_setup_rx_descriptors+0x15/0xa9
[224051.287278] Modules linked in: team_mode_roundrobin team 
x86_pkg_temp_thermal ipmi_si

[224051.287327] CPU: 3 PID: 25031 Comm: ip Tainted: G W 4.12.14 #2
[224051.287330] task: 880859e09880 task.stack: c900036ec000
[224051.287332] RIP: 0010:i40e_setup_rx_descriptors+0x15/0xa9
[224051.287332] RSP: 0018:c900036ef6e8 EFLAGS: 00010286
[224051.287333] RAX: 8808595eda00 RBX: 880856d36d00 RCX: 
014000c1
[224051.287334] RDX: 0001 RSI: 880844418000 RDI: 
880856d36d00
[224051.287334] RBP: c900036ef6f8 R08: 0001ccc3 R09: 
ea0021110620
[224051.287335] R10:  R11: 88087effae90 R12: 
8808590300a0
[224051.287335] R13: 0002 R14: fff0 R15: 
0001
[224051.287336] FS:  7f1e4658b740() GS:88085e2c() 
knlGS:

[224051.287337] CS:  0010 DS:  ES:  CR0: 80050033
[224051.287337] CR2: 7ffd7479 CR3: 00059e2f4000 CR4: 
001406e0

[224051.287338] Call Trace:
[224051.287339]  i40e_vsi_open+0x7d/0x1e7
[224051.287341]  i40e_open+0x4d/0xc3
[224051.287342]  __dev_open+0x8b/0xcd
[224051.287344]  __dev_change_flags+0xa2/0x13d
[224051.287346]  dev_change_flags+0x20/0x53
[224051.287347]  do_setlink+0x2d0/0xad6
[224051.287349]  ? zone_statistics+0x5a/0x61
[224051.287350]  ? get_page_from_freelist+0x4c8/0x627
[224051.287352]  rtnl_newlink+0x391/0x6d6
[224051.287353]  ? netdev_master_upper_dev_get+0xd/0x57
[224051.287354]  ? rtnl_newlink+0x106/0x6d6
[224051.287356]  ? alloc_pages_vma+0x8c/0x17a
[224051.287357]  ? pagevec_lru_move_fn+0x20/0xc1
[224051.287359]  ? lru_cache_add_active_or_unevictable+0x27/0x7a
[224051.287360]  ? __handle_mm_fault+0x4c1/0x8ae
[224051.287362]  rtnetlink_rcv_msg+0x166/0x173
[224051.287363]  ? __kmalloc_node_track_caller+0x11f/0x12f
[224051.287365]  ? __alloc_skb+0x89/0x175
[224051.287366]  ? rtnl_newlink+0x6d6/0x6d6
[224051.287367]  netlink_rcv_skb+0x57/0xa0
[224051.287369]  rtnetlink_rcv+0x1e/0x25
[224051.287371]  netlink_unicast+0x103/0x187
[224051.287372]  netlink_sendmsg+0x28d/0x2ad
[224051.287374]  sock_sendmsg_nosec+0x12/0x1d
[224051.287375]  ___sys_sendmsg+0x19d/0x217
[224051.287377]  ? kmem_cache_free+0x4b/0xf3
[224051.287492]  ? alloc_pages_vma+0x147/0x17a
[224051.287494]  ? __page_set_anon_rmap+0x24/0x65
[224051.287495]  ? get_page+0x9/0xf
[224051.287496]  ? __lru_cache_add+0x18/0x47
[224051.287498]  ? __handle_mm_fault+0x4c1/0x8ae
[224051.287499]  __sys_sendmsg+0x40/0x5e
[224051.287564]  ? __sys_sendmsg+0x40/0x5e
[224051.287566]  SyS_sendmsg+0xd/0x17
[224051.287567]  entry_SYSCALL_64_fastpath+0x13/0x94
[224051.287568] RIP: 0033:0x7f1e45cac620
[224051.287569] RSP: 002b:7ffd7478b4d8 EFLAGS: 0246 ORIG_RAX: 
002e
[224051.287570] RAX: ffda RBX:  RCX: 
7f1e45cac620
[224051.287571] RDX:  RSI: 7ffd7478b520 RDI: 
0003
[224051.287572] RBP: 7ffd7478b520 R08: 0001 R09: 
fefefeff77686d74
[224051.287572] R10: 05e6 R11: 0246 R12: 
7ffd7478b560
[224051.287573] R13: 006724c0 R14: 7ffd747935e0 R15: 

[224051.287574] Code: 00 00 48 8b 7b 10 e8 41 f2 ff ff 48 c7 43 08 00 
00 00 00 5b 5d c3 55 48 89 e5 41 54 53 48 83 7f 20 00 48 89 fb 4c 8b 
67 10 74 02 <0f> ff 0f b7 7b 44 48 6b ff 18 e8 65 f5 ff ff 48 85 c0 48 
89 43

[224051.287597] ---[ end trace a9810da52af61a5a ]---
[224051.287607] [ cut here ]
[224051.287609] WAR

intel i40e buggy driver question


Hi




I have many problems with 40e driver

memleaks , kernel panics , stack traces , tx hungx , tx timeouts and 
many many others :)



But the main problem that can't be resolved in linux is resolved in freebsd

problem in freebsd with this:

[2501243.181829] i40e :01:00.1 eno2: VSI_seid 390, Hung TX queue 17, 
tx_pending_hw: 1, NTC:0x16b, HWB: 0x16b, NTU: 0x16c, TAIL: 0x16c
[2501243.181835] i40e :01:00.1 eno2: VSI_seid 390, Issuing force_wb 
for TX queue 17, Interrupt Reg: 0x0



Was solved by this:


"

change this piece in ixl_tso_detect_sparse() in ixl_txrx.c:

    if (mss < 1) {
    if (num > IXL_SPARSE_CHAIN)
    return (true);
    num = (mss == 0) ? 0 : 1;
    mss += mp->m_pkthdr.tso_segsz;
    }

to

    if (num > IXL_SPARSE_CHAIN)
    return (true);
    if (mss < 1) {
    num = (mss == 0) ? 0 : 1;
    mss += mp->m_pkthdr.tso_segsz;
    }

Intel FreeBSD Team: This will definitely prevent MDDs on the buffers you 
sent me.


"


An I have a question - how to do the same in linux ? :)

Cause i have same problem in Linux with this i40e buggy driver:

[224051.287277] WARNING: CPU: 3 PID: 25031 at 
drivers/net/ethernet/intel/i40e/i40e_txrx.c:1248 
i40e_setup_rx_descriptors+0x15/0xa9
[224051.287278] Modules linked in: team_mode_roundrobin team 
x86_pkg_temp_thermal ipmi_si

[224051.287327] CPU: 3 PID: 25031 Comm: ip Tainted: G W   4.12.14 #2
[224051.287330] task: 880859e09880 task.stack: c900036ec000
[224051.287332] RIP: 0010:i40e_setup_rx_descriptors+0x15/0xa9
[224051.287332] RSP: 0018:c900036ef6e8 EFLAGS: 00010286
[224051.287333] RAX: 8808595eda00 RBX: 880856d36d00 RCX: 
014000c1
[224051.287334] RDX: 0001 RSI: 880844418000 RDI: 
880856d36d00
[224051.287334] RBP: c900036ef6f8 R08: 0001ccc3 R09: 
ea0021110620
[224051.287335] R10:  R11: 88087effae90 R12: 
8808590300a0
[224051.287335] R13: 0002 R14: fff0 R15: 
0001
[224051.287336] FS:  7f1e4658b740() GS:88085e2c() 
knlGS:

[224051.287337] CS:  0010 DS:  ES:  CR0: 80050033
[224051.287337] CR2: 7ffd7479 CR3: 00059e2f4000 CR4: 
001406e0

[224051.287338] Call Trace:
[224051.287339]  i40e_vsi_open+0x7d/0x1e7
[224051.287341]  i40e_open+0x4d/0xc3
[224051.287342]  __dev_open+0x8b/0xcd
[224051.287344]  __dev_change_flags+0xa2/0x13d
[224051.287346]  dev_change_flags+0x20/0x53
[224051.287347]  do_setlink+0x2d0/0xad6
[224051.287349]  ? zone_statistics+0x5a/0x61
[224051.287350]  ? get_page_from_freelist+0x4c8/0x627
[224051.287352]  rtnl_newlink+0x391/0x6d6
[224051.287353]  ? netdev_master_upper_dev_get+0xd/0x57
[224051.287354]  ? rtnl_newlink+0x106/0x6d6
[224051.287356]  ? alloc_pages_vma+0x8c/0x17a
[224051.287357]  ? pagevec_lru_move_fn+0x20/0xc1
[224051.287359]  ? lru_cache_add_active_or_unevictable+0x27/0x7a
[224051.287360]  ? __handle_mm_fault+0x4c1/0x8ae
[224051.287362]  rtnetlink_rcv_msg+0x166/0x173
[224051.287363]  ? __kmalloc_node_track_caller+0x11f/0x12f
[224051.287365]  ? __alloc_skb+0x89/0x175
[224051.287366]  ? rtnl_newlink+0x6d6/0x6d6
[224051.287367]  netlink_rcv_skb+0x57/0xa0
[224051.287369]  rtnetlink_rcv+0x1e/0x25
[224051.287371]  netlink_unicast+0x103/0x187
[224051.287372]  netlink_sendmsg+0x28d/0x2ad
[224051.287374]  sock_sendmsg_nosec+0x12/0x1d
[224051.287375]  ___sys_sendmsg+0x19d/0x217
[224051.287377]  ? kmem_cache_free+0x4b/0xf3
[224051.287492]  ? alloc_pages_vma+0x147/0x17a
[224051.287494]  ? __page_set_anon_rmap+0x24/0x65
[224051.287495]  ? get_page+0x9/0xf
[224051.287496]  ? __lru_cache_add+0x18/0x47
[224051.287498]  ? __handle_mm_fault+0x4c1/0x8ae
[224051.287499]  __sys_sendmsg+0x40/0x5e
[224051.287564]  ? __sys_sendmsg+0x40/0x5e
[224051.287566]  SyS_sendmsg+0xd/0x17
[224051.287567]  entry_SYSCALL_64_fastpath+0x13/0x94
[224051.287568] RIP: 0033:0x7f1e45cac620
[224051.287569] RSP: 002b:7ffd7478b4d8 EFLAGS: 0246 ORIG_RAX: 
002e
[224051.287570] RAX: ffda RBX:  RCX: 
7f1e45cac620
[224051.287571] RDX:  RSI: 7ffd7478b520 RDI: 
0003
[224051.287572] RBP: 7ffd7478b520 R08: 0001 R09: 
fefefeff77686d74
[224051.287572] R10: 05e6 R11: 0246 R12: 
7ffd7478b560
[224051.287573] R13: 006724c0 R14: 7ffd747935e0 R15: 

[224051.287574] Code: 00 00 48 8b 7b 10 e8 41 f2 ff ff 48 c7 43 08 00 00 
00 00 5b 5d c3 55 48 89 e5 41 54 53 48 83 7f 20 00 48 89 fb 4c 8b 67 10 
74 02 <0f> ff 0f b7 7b 44 48 6b ff 18 e8 65 f5 ff ff 48 85 c0 48 89 43

[224051.287597] ---[ end trace a9810da52af61a5a ]---
[224051.287607] [ cut here ]
[224051.287609] WARNING: CPU: 3 PID: 25031 at 
drivers/net/ethernet/intel/i40e/i40e_txrx.c:1248 
i40e_setup_rx

Re: [Intel-wired-lan] [jkirsher/net-queue PATCH] i40e: Add programming descriptors to cleaned_count


from today it is in net.git

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/log/?qt=grep&q=i40e

It will be later in net-next


Also can You please tell me what firmware You are using with Your nics ?

Those are X710 ?


Thanks

Paweł



W dniu 2017-10-27 o 23:20, Pavlos Parissis pisze:

On 23 October 2017 at 01:15, Paweł Staszewski  wrote:

Yes can confirm that after adding patch:

[jkirsher/net-queue PATCH] i40e: Add programming descriptors to
cleaned_count


There is no memleak.



Somehow this patch isn't present in the current net-next repo.
Shouldn't it be there?

Cheers,
Pavlos

Re: [jkirsher/net-queue PATCH] i40e: Add programming descriptors to cleaned_count

2017-10-22 Thread Paweł Staszewski


Yes can confirm that after adding patch:

[jkirsher/net-queue PATCH] i40e: Add programming descriptors to 
cleaned_count



There is no memleak.



W dniu 2017-10-22 o 20:01, Anders K. Pedersen | Cohaesio pisze:

On lør, 2017-10-21 at 18:12 -0700, Alexander Duyck wrote:

From: Alexander Duyck 

This patch updates the i40e driver to include programming descriptors
in
the cleaned_count. Without this change it becomes possible for us to
leak
memory as we don't trigger a large enough allocation when the time
comes to
allocate new buffers and we end up overwriting a number of rx_buffers
equal
to the number of programming descriptors we encountered.

Fixes: 0e626ff7ccbf ("i40e: Fix support for flow director programming
status")
Signed-off-by: Alexander Duyck 

This patch solves the remaining memory leak we've seen, so

Tested-by: Anders K. Pedersen 

Regards,
Anders

Re: Linux 4.12+ memory leak on router with i40e NICs




W dniu 2017-10-19 o 01:56, Paweł Staszewski pisze:



W dniu 2017-10-19 o 01:51, Paweł Staszewski pisze:



W dniu 2017-10-19 o 01:37, Alexander Duyck pisze:
On Wed, Oct 18, 2017 at 4:22 PM, Paweł Staszewski 
 wrote:


W dniu 2017-10-19 o 00:58, Paweł Staszewski pisze:



W dniu 2017-10-19 o 00:50, Paweł Staszewski pisze:



W dniu 2017-10-19 o 00:20, Paweł Staszewski pisze:



W dniu 2017-10-18 o 17:44, Paweł Staszewski pisze:



W dniu 2017-10-17 o 16:08, Paweł Staszewski pisze:



W dniu 2017-10-17 o 13:52, Paweł Staszewski pisze:



W dniu 2017-10-17 o 13:05, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:59, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:51, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:20, Paweł Staszewski pisze:



W dniu 2017-10-17 o 11:48, Paweł Staszewski pisze:



W dniu 2017-10-17 o 02:44, Paweł Staszewski pisze:



W dniu 2017-10-17 o 01:56, Alexander Duyck pisze:

On Mon, Oct 16, 2017 at 4:34 PM, Paweł Staszewski
 wrote:


W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze:


W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze:

On 15/10/2017 02:58 πμ, Alexander Duyck wrote:

Hi Pawel,

To clarify is that Dave Miller's tree or Linus's 
that you

are talking
about? If it is Dave's tree how long ago was it 
you pulled

it since I
think the fix was just pushed by Jeff Kirsher a 
few days

ago.

The issue should be fixed in the following commit:


https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 



Do you know when it is going to be available on 
net-next

and linux-stable
repos?

Cheers,
Pavlos


I will make some tests today night with "net" git 
tree where

this patch is
included.
Starting from 0:00 CET
:)


Upgraded and looks like problem is not solved with 
that patch

Currently running system with

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/ 


kernel

Still about 0.5GB of memory is leaking somewhere

Also can confirm that the latest kernel where memory 
is not

leaking (with
use i40e driver intel 710 cards) is 4.11.12
With kernel 4.11.12 - after hour no change in memory 
usage.


also checked that with ixgbe instead of i40e with same
net.git kernel there
is no memleak - after hour same memory usage - so for 
100%

this is i40e
driver problem.
So how long was the run to get the .5GB of memory 
leaking?

1 hour


Also is there any chance of you being able to bisect to
determine
where the memory leak was introduced since as you 
pointed out

it
didn't exist in 4.11.12 so odds are it was introduced
somewhere
between 4.11 and the latest kernel release.
Can be hard cause currently need to back to 4.11.12 - 
this is

production host/router
Will try to find some free/test router for 
tests/bicects with

i40e driver (intel 710 cards)


Thanks.

- Alex




Also forgoto to add errors for i40e when driver initialize:
[   15.760569] i40e :02:00.1: Error 
I40E_AQ_RC_ENOSPC adding

RX filters on PF, promiscuous mode forced on
[   16.365587] i40e :03:00.3: Error 
I40E_AQ_RC_ENOSPC adding

RX filters on PF, promiscuous mode forced on
[   16.367686] i40e :02:00.2: Error 
I40E_AQ_RC_ENOSPC adding

RX filters on PF, promiscuous mode forced on
[   16.368816] i40e :03:00.0: Error 
I40E_AQ_RC_ENOSPC adding

RX filters on PF, promiscuous mode forced on
[   16.369877] i40e :03:00.2: Error 
I40E_AQ_RC_ENOSPC adding

RX filters on PF, promiscuous mode forced on
[   16.370941] i40e :02:00.3: Error 
I40E_AQ_RC_ENOSPC adding

RX filters on PF, promiscuous mode forced on
[   16.372005] i40e :02:00.0: Error 
I40E_AQ_RC_ENOSPC adding

RX filters on PF, promiscuous mode forced on
[   16.373029] i40e :03:00.1: Error 
I40E_AQ_RC_ENOSPC adding

RX filters on PF, promiscuous mode forced on

some params that are set for this nic's
 ip link set up dev $i
 ethtool -A $i autoneg off rx off tx off
 ethtool -G $i rx 1024 tx 2048
 ip link set $i txqueuelen 1000
 ethtool -C $i adaptive-rx off adaptive-tx off 
rx-usecs

512 tx-usecs 128
 ethtool -L $i combined 6
 #ethtool -N $i rx-flow-hash udp4 sdfn
 ethtool -K $i ntuple on
 ethtool -K $i gro off
 ethtool -K $i tso off




Also after TSO/GRO on there is memory usage change - and 
leaking

faster
Below image from memory usage before change with TSO/GRO 
OFF and

after enabling TSO/GRO

https://ibb.co/dTqBY6


Thanks
Pawel




With settings like this:
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1
enp3s0f2 enp3s0f3'
for i in $ifc
 do
 ethtool -C $i adaptive-rx off adaptive-tx off 
rx-usecs 512

tx-usecs 128
 ethtool -K $i gro on
 ethtool -K $i tso on

 done

Server is leaking about 4-6MB per each 10 seconds
MEMLEAK:
5  MB/10sec
6  MB/10sec
4  MB/10sec
4  MB/10sec


Other settings TSO/GRO off
ifc='enp2s0f0 enp2s0f1 enp2s0f2 e

Re: Linux 4.12+ memory leak on router with i40e NICs




W dniu 2017-10-19 o 01:51, Paweł Staszewski pisze:



W dniu 2017-10-19 o 01:37, Alexander Duyck pisze:
On Wed, Oct 18, 2017 at 4:22 PM, Paweł Staszewski 
 wrote:


W dniu 2017-10-19 o 00:58, Paweł Staszewski pisze:



W dniu 2017-10-19 o 00:50, Paweł Staszewski pisze:



W dniu 2017-10-19 o 00:20, Paweł Staszewski pisze:



W dniu 2017-10-18 o 17:44, Paweł Staszewski pisze:



W dniu 2017-10-17 o 16:08, Paweł Staszewski pisze:



W dniu 2017-10-17 o 13:52, Paweł Staszewski pisze:



W dniu 2017-10-17 o 13:05, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:59, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:51, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:20, Paweł Staszewski pisze:



W dniu 2017-10-17 o 11:48, Paweł Staszewski pisze:



W dniu 2017-10-17 o 02:44, Paweł Staszewski pisze:



W dniu 2017-10-17 o 01:56, Alexander Duyck pisze:

On Mon, Oct 16, 2017 at 4:34 PM, Paweł Staszewski
 wrote:


W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze:


W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze:

On 15/10/2017 02:58 πμ, Alexander Duyck wrote:

Hi Pawel,

To clarify is that Dave Miller's tree or Linus's 
that you

are talking
about? If it is Dave's tree how long ago was it you 
pulled

it since I
think the fix was just pushed by Jeff Kirsher a few 
days

ago.

The issue should be fixed in the following commit:


https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 



Do you know when it is going to be available on 
net-next

and linux-stable
repos?

Cheers,
Pavlos


I will make some tests today night with "net" git 
tree where

this patch is
included.
Starting from 0:00 CET
:)


Upgraded and looks like problem is not solved with 
that patch

Currently running system with

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/ 


kernel

Still about 0.5GB of memory is leaking somewhere

Also can confirm that the latest kernel where memory 
is not

leaking (with
use i40e driver intel 710 cards) is 4.11.12
With kernel 4.11.12 - after hour no change in memory 
usage.


also checked that with ixgbe instead of i40e with same
net.git kernel there
is no memleak - after hour same memory usage - so for 
100%

this is i40e
driver problem.

So how long was the run to get the .5GB of memory leaking?

1 hour


Also is there any chance of you being able to bisect to
determine
where the memory leak was introduced since as you 
pointed out

it
didn't exist in 4.11.12 so odds are it was introduced
somewhere
between 4.11 and the latest kernel release.
Can be hard cause currently need to back to 4.11.12 - 
this is

production host/router
Will try to find some free/test router for tests/bicects 
with

i40e driver (intel 710 cards)


Thanks.

- Alex




Also forgoto to add errors for i40e when driver initialize:
[   15.760569] i40e :02:00.1: Error I40E_AQ_RC_ENOSPC 
adding

RX filters on PF, promiscuous mode forced on
[   16.365587] i40e :03:00.3: Error I40E_AQ_RC_ENOSPC 
adding

RX filters on PF, promiscuous mode forced on
[   16.367686] i40e :02:00.2: Error I40E_AQ_RC_ENOSPC 
adding

RX filters on PF, promiscuous mode forced on
[   16.368816] i40e :03:00.0: Error I40E_AQ_RC_ENOSPC 
adding

RX filters on PF, promiscuous mode forced on
[   16.369877] i40e :03:00.2: Error I40E_AQ_RC_ENOSPC 
adding

RX filters on PF, promiscuous mode forced on
[   16.370941] i40e :02:00.3: Error I40E_AQ_RC_ENOSPC 
adding

RX filters on PF, promiscuous mode forced on
[   16.372005] i40e :02:00.0: Error I40E_AQ_RC_ENOSPC 
adding

RX filters on PF, promiscuous mode forced on
[   16.373029] i40e :03:00.1: Error I40E_AQ_RC_ENOSPC 
adding

RX filters on PF, promiscuous mode forced on

some params that are set for this nic's
 ip link set up dev $i
 ethtool -A $i autoneg off rx off tx off
 ethtool -G $i rx 1024 tx 2048
 ip link set $i txqueuelen 1000
 ethtool -C $i adaptive-rx off adaptive-tx off 
rx-usecs

512 tx-usecs 128
 ethtool -L $i combined 6
 #ethtool -N $i rx-flow-hash udp4 sdfn
 ethtool -K $i ntuple on
 ethtool -K $i gro off
 ethtool -K $i tso off




Also after TSO/GRO on there is memory usage change - and 
leaking

faster
Below image from memory usage before change with TSO/GRO 
OFF and

after enabling TSO/GRO

https://ibb.co/dTqBY6


Thanks
Pawel




With settings like this:
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1
enp3s0f2 enp3s0f3'
for i in $ifc
 do
 ethtool -C $i adaptive-rx off adaptive-tx off 
rx-usecs 512

tx-usecs 128
 ethtool -K $i gro on
 ethtool -K $i tso on

 done

Server is leaking about 4-6MB per each 10 seconds
MEMLEAK:
5  MB/10sec
6  MB/10sec
4  MB/10sec
4  MB/10sec


Other settings TSO/GRO off
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1
enp3s0f2 enp3s0f3'
for i

Re: Linux 4.12+ memory leak on router with i40e NICs




W dniu 2017-10-19 o 01:37, Alexander Duyck pisze:

On Wed, Oct 18, 2017 at 4:22 PM, Paweł Staszewski  wrote:


W dniu 2017-10-19 o 00:58, Paweł Staszewski pisze:



W dniu 2017-10-19 o 00:50, Paweł Staszewski pisze:



W dniu 2017-10-19 o 00:20, Paweł Staszewski pisze:



W dniu 2017-10-18 o 17:44, Paweł Staszewski pisze:



W dniu 2017-10-17 o 16:08, Paweł Staszewski pisze:



W dniu 2017-10-17 o 13:52, Paweł Staszewski pisze:



W dniu 2017-10-17 o 13:05, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:59, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:51, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:20, Paweł Staszewski pisze:



W dniu 2017-10-17 o 11:48, Paweł Staszewski pisze:



W dniu 2017-10-17 o 02:44, Paweł Staszewski pisze:



W dniu 2017-10-17 o 01:56, Alexander Duyck pisze:

On Mon, Oct 16, 2017 at 4:34 PM, Paweł Staszewski
 wrote:


W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze:


W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze:

On 15/10/2017 02:58 πμ, Alexander Duyck wrote:

Hi Pawel,

To clarify is that Dave Miller's tree or Linus's that you
are talking
about? If it is Dave's tree how long ago was it you pulled
it since I
think the fix was just pushed by Jeff Kirsher a few days
ago.

The issue should be fixed in the following commit:


https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972


Do you know when it is going to be available on net-next
and linux-stable
repos?

Cheers,
Pavlos



I will make some tests today night with "net" git tree where
this patch is
included.
Starting from 0:00 CET
:)



Upgraded and looks like problem is not solved with that patch
Currently running system with

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/
kernel

Still about 0.5GB of memory is leaking somewhere

Also can confirm that the latest kernel where memory is not
leaking (with
use i40e driver intel 710 cards) is 4.11.12
With kernel 4.11.12 - after hour no change in memory usage.

also checked that with ixgbe instead of i40e with same
net.git kernel there
is no memleak - after hour same memory usage - so for 100%
this is i40e
driver problem.

So how long was the run to get the .5GB of memory leaking?

1 hour


Also is there any chance of you being able to bisect to
determine
where the memory leak was introduced since as you pointed out
it
didn't exist in 4.11.12 so odds are it was introduced
somewhere
between 4.11 and the latest kernel release.

Can be hard cause currently need to back to 4.11.12 - this is
production host/router
Will try to find some free/test router for tests/bicects with
i40e driver (intel 710 cards)


Thanks.

- Alex




Also forgoto to add errors for i40e when driver initialize:
[   15.760569] i40e :02:00.1: Error I40E_AQ_RC_ENOSPC adding
RX filters on PF, promiscuous mode forced on
[   16.365587] i40e :03:00.3: Error I40E_AQ_RC_ENOSPC adding
RX filters on PF, promiscuous mode forced on
[   16.367686] i40e :02:00.2: Error I40E_AQ_RC_ENOSPC adding
RX filters on PF, promiscuous mode forced on
[   16.368816] i40e :03:00.0: Error I40E_AQ_RC_ENOSPC adding
RX filters on PF, promiscuous mode forced on
[   16.369877] i40e :03:00.2: Error I40E_AQ_RC_ENOSPC adding
RX filters on PF, promiscuous mode forced on
[   16.370941] i40e :02:00.3: Error I40E_AQ_RC_ENOSPC adding
RX filters on PF, promiscuous mode forced on
[   16.372005] i40e :02:00.0: Error I40E_AQ_RC_ENOSPC adding
RX filters on PF, promiscuous mode forced on
[   16.373029] i40e :03:00.1: Error I40E_AQ_RC_ENOSPC adding
RX filters on PF, promiscuous mode forced on

some params that are set for this nic's
 ip link set up dev $i
 ethtool -A $i autoneg off rx off tx off
 ethtool -G $i rx 1024 tx 2048
 ip link set $i txqueuelen 1000
 ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs
512 tx-usecs 128
 ethtool -L $i combined 6
 #ethtool -N $i rx-flow-hash udp4 sdfn
 ethtool -K $i ntuple on
 ethtool -K $i gro off
 ethtool -K $i tso off





Also after TSO/GRO on there is memory usage change - and leaking
faster
Below image from memory usage before change with TSO/GRO OFF and
after enabling TSO/GRO

https://ibb.co/dTqBY6


Thanks
Pawel




With settings like this:
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1
enp3s0f2 enp3s0f3'
for i in $ifc
 do
 ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512
tx-usecs 128
 ethtool -K $i gro on
 ethtool -K $i tso on

 done

Server is leaking about 4-6MB per each 10 seconds
MEMLEAK:
5  MB/10sec
6  MB/10sec
4  MB/10sec
4  MB/10sec


Other settings TSO/GRO off
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1
enp3s0f2 enp3s0f3'
for i in $ifc
 do
 ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512
tx-usecs 12

Re: Linux 4.12+ memory leak on router with i40e NICs




W dniu 2017-10-19 o 01:29, Alexander Duyck pisze:

On Mon, Oct 16, 2017 at 10:51 PM, Vitezslav Samel  wrote:

On Tue, Oct 17, 2017 at 01:34:29AM +0200, Paweł Staszewski wrote:

W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze:

W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze:

On 15/10/2017 02:58 πμ, Alexander Duyck wrote:

Hi Pawel,

To clarify is that Dave Miller's tree or Linus's that you are talking
about? If it is Dave's tree how long ago was it you pulled it since I
think the fix was just pushed by Jeff Kirsher a few days ago.

The issue should be fixed in the following commit:
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972

Do you know when it is going to be available on net-next and
linux-stable repos?

Cheers,
Pavlos



I will make some tests today night with "net" git tree where this patch
is included.
Starting from 0:00 CET
:)



Upgraded and looks like problem is not solved with that patch
Currently running system with
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/
kernel

Still about 0.5GB of memory is leaking somewhere

Also can confirm that the latest kernel where memory is not leaking (with
use i40e driver intel 710 cards) is 4.11.12
With kernel 4.11.12 - after hour no change in memory usage.

also checked that with ixgbe instead of i40e with same  net.git kernel there
is no memleak - after hour same memory usage - so for 100% this is i40e
driver problem.

   I have (probably) the same problem here but with X520 cards: booting
4.12.x gives me oops after circa 20 minutes of our workload. Booting
4.9.y is OK. This machine is in production so any testing is very
limited.

   Machine was stable for >2 months (on the desk before got to
production) with 4.12.8 but with no traffic on X520 cards.

 Cheers,

 Vita

Sorry but it can't be the same issue since we are discussing a
different driver (i40e) running different hardware (X710 or XL170).
You might want to start a new thread for your issue, and/or if
possible file a bug on e1000.sf.net.

Thanks.

- Alex

sorry but bugs reported on e1000.sf.net are delayed - some after about 6 
or more months - when i reported first bug there iv got reply after a 
year about no activity :):) haha - and reported there bug is still 
actrive :)
better for me is now to change nics (for sure cheaper from  the 
perspective of clients :) ) to mellanox or just to replace and use ixgbe 
- that have no this bug (mellanox and ixgbe have no such bug - have many 
servers with them with same conf - and only one with i40e where is same 
conf and memleak)


If nobody from Intel wants to reproduce this - qool - this is not my 
problem but intels :) - there is now many good nics to use - like 
mellanox or just stick with many 10G based on ixgbe that is really good 
driver - but really ? intel guys have no XL710 cards ? i dont want to 
buy another buggy cards to do only kernel bisects  sorry 
To do good bisects with this bug You need to spend maybee 200/300 
bisects - and to confirm each - You need maybee 30minutes so count how 
much time You need - more that 100 cards in price from mellanox maybee :)


so imagine what i will do :)


Thanks
Paweł

Re: Linux 4.12+ memory leak on router with i40e NICs




W dniu 2017-10-19 o 00:58, Paweł Staszewski pisze:



W dniu 2017-10-19 o 00:50, Paweł Staszewski pisze:



W dniu 2017-10-19 o 00:20, Paweł Staszewski pisze:



W dniu 2017-10-18 o 17:44, Paweł Staszewski pisze:



W dniu 2017-10-17 o 16:08, Paweł Staszewski pisze:



W dniu 2017-10-17 o 13:52, Paweł Staszewski pisze:



W dniu 2017-10-17 o 13:05, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:59, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:51, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:20, Paweł Staszewski pisze:



W dniu 2017-10-17 o 11:48, Paweł Staszewski pisze:



W dniu 2017-10-17 o 02:44, Paweł Staszewski pisze:



W dniu 2017-10-17 o 01:56, Alexander Duyck pisze:
On Mon, Oct 16, 2017 at 4:34 PM, Paweł Staszewski 
 wrote:


W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze:



W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze:

On 15/10/2017 02:58 πμ, Alexander Duyck wrote:

Hi Pawel,

To clarify is that Dave Miller's tree or Linus's that 
you are talking
about? If it is Dave's tree how long ago was it you 
pulled it since I
think the fix was just pushed by Jeff Kirsher a few 
days ago.


The issue should be fixed in the following commit:

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 



Do you know when it is going to be available on 
net-next and linux-stable

repos?

Cheers,
Pavlos


I will make some tests today night with "net" git tree 
where this patch is

included.
Starting from 0:00 CET
:)


Upgraded and looks like problem is not solved with that 
patch

Currently running system with
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/ 


kernel

Still about 0.5GB of memory is leaking somewhere

Also can confirm that the latest kernel where memory is 
not leaking (with

use i40e driver intel 710 cards) is 4.11.12
With kernel 4.11.12 - after hour no change in memory usage.

also checked that with ixgbe instead of i40e with same 
net.git kernel there
is no memleak - after hour same memory usage - so for 
100% this is i40e

driver problem.

So how long was the run to get the .5GB of memory leaking?

1 hour



Also is there any chance of you being able to bisect to 
determine
where the memory leak was introduced since as you pointed 
out it
didn't exist in 4.11.12 so odds are it was introduced 
somewhere

between 4.11 and the latest kernel release.
Can be hard cause currently need to back to 4.11.12 - this 
is production host/router
Will try to find some free/test router for tests/bicects 
with i40e driver (intel 710 cards)




Thanks.

- Alex





Also forgoto to add errors for i40e when driver initialize:
[   15.760569] i40e :02:00.1: Error I40E_AQ_RC_ENOSPC 
adding RX filters on PF, promiscuous mode forced on
[   16.365587] i40e :03:00.3: Error I40E_AQ_RC_ENOSPC 
adding RX filters on PF, promiscuous mode forced on
[   16.367686] i40e :02:00.2: Error I40E_AQ_RC_ENOSPC 
adding RX filters on PF, promiscuous mode forced on
[   16.368816] i40e :03:00.0: Error I40E_AQ_RC_ENOSPC 
adding RX filters on PF, promiscuous mode forced on
[   16.369877] i40e :03:00.2: Error I40E_AQ_RC_ENOSPC 
adding RX filters on PF, promiscuous mode forced on
[   16.370941] i40e :02:00.3: Error I40E_AQ_RC_ENOSPC 
adding RX filters on PF, promiscuous mode forced on
[   16.372005] i40e :02:00.0: Error I40E_AQ_RC_ENOSPC 
adding RX filters on PF, promiscuous mode forced on
[   16.373029] i40e :03:00.1: Error I40E_AQ_RC_ENOSPC 
adding RX filters on PF, promiscuous mode forced on


some params that are set for this nic's
    ip link set up dev $i
    ethtool -A $i autoneg off rx off tx off
    ethtool -G $i rx 1024 tx 2048
    ip link set $i txqueuelen 1000
    ethtool -C $i adaptive-rx off adaptive-tx off 
rx-usecs 512 tx-usecs 128

    ethtool -L $i combined 6
    #ethtool -N $i rx-flow-hash udp4 sdfn
    ethtool -K $i ntuple on
    ethtool -K $i gro off
    ethtool -K $i tso off




Also after TSO/GRO on there is memory usage change - and 
leaking faster
Below image from memory usage before change with TSO/GRO OFF 
and after enabling TSO/GRO


https://ibb.co/dTqBY6


Thanks
Pawel




With settings like this:
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 
enp3s0f2 enp3s0f3'

for i in $ifc
    do
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 
512 tx-usecs 128

    ethtool -K $i gro on
    ethtool -K $i tso on

    done

Server is leaking about 4-6MB per each 10 seconds
MEMLEAK:
5  MB/10sec
6  MB/10sec
4  MB/10sec
4  MB/10sec


Other settings TSO/GRO off
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 
enp3s0f2 enp3s0f3'

for i in $ifc
    do
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 
512 tx-usecs 128

    ethtool -K $i gro off
    ethtool -K $i tso off

    done

Same l

Re: Linux 4.12+ memory leak on router with i40e NICs




W dniu 2017-10-19 o 00:50, Paweł Staszewski pisze:



W dniu 2017-10-19 o 00:20, Paweł Staszewski pisze:



W dniu 2017-10-18 o 17:44, Paweł Staszewski pisze:



W dniu 2017-10-17 o 16:08, Paweł Staszewski pisze:



W dniu 2017-10-17 o 13:52, Paweł Staszewski pisze:



W dniu 2017-10-17 o 13:05, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:59, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:51, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:20, Paweł Staszewski pisze:



W dniu 2017-10-17 o 11:48, Paweł Staszewski pisze:



W dniu 2017-10-17 o 02:44, Paweł Staszewski pisze:



W dniu 2017-10-17 o 01:56, Alexander Duyck pisze:
On Mon, Oct 16, 2017 at 4:34 PM, Paweł Staszewski 
 wrote:


W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze:



W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze:

On 15/10/2017 02:58 πμ, Alexander Duyck wrote:

Hi Pawel,

To clarify is that Dave Miller's tree or Linus's that 
you are talking
about? If it is Dave's tree how long ago was it you 
pulled it since I
think the fix was just pushed by Jeff Kirsher a few 
days ago.


The issue should be fixed in the following commit:

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 



Do you know when it is going to be available on net-next 
and linux-stable

repos?

Cheers,
Pavlos


I will make some tests today night with "net" git tree 
where this patch is

included.
Starting from 0:00 CET
:)



Upgraded and looks like problem is not solved with that patch
Currently running system with
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/ 


kernel

Still about 0.5GB of memory is leaking somewhere

Also can confirm that the latest kernel where memory is 
not leaking (with

use i40e driver intel 710 cards) is 4.11.12
With kernel 4.11.12 - after hour no change in memory usage.

also checked that with ixgbe instead of i40e with same 
net.git kernel there
is no memleak - after hour same memory usage - so for 100% 
this is i40e

driver problem.

So how long was the run to get the .5GB of memory leaking?

1 hour



Also is there any chance of you being able to bisect to 
determine
where the memory leak was introduced since as you pointed 
out it
didn't exist in 4.11.12 so odds are it was introduced 
somewhere

between 4.11 and the latest kernel release.
Can be hard cause currently need to back to 4.11.12 - this 
is production host/router
Will try to find some free/test router for tests/bicects 
with i40e driver (intel 710 cards)




Thanks.

- Alex





Also forgoto to add errors for i40e when driver initialize:
[   15.760569] i40e :02:00.1: Error I40E_AQ_RC_ENOSPC 
adding RX filters on PF, promiscuous mode forced on
[   16.365587] i40e :03:00.3: Error I40E_AQ_RC_ENOSPC 
adding RX filters on PF, promiscuous mode forced on
[   16.367686] i40e :02:00.2: Error I40E_AQ_RC_ENOSPC 
adding RX filters on PF, promiscuous mode forced on
[   16.368816] i40e :03:00.0: Error I40E_AQ_RC_ENOSPC 
adding RX filters on PF, promiscuous mode forced on
[   16.369877] i40e :03:00.2: Error I40E_AQ_RC_ENOSPC 
adding RX filters on PF, promiscuous mode forced on
[   16.370941] i40e :02:00.3: Error I40E_AQ_RC_ENOSPC 
adding RX filters on PF, promiscuous mode forced on
[   16.372005] i40e :02:00.0: Error I40E_AQ_RC_ENOSPC 
adding RX filters on PF, promiscuous mode forced on
[   16.373029] i40e :03:00.1: Error I40E_AQ_RC_ENOSPC 
adding RX filters on PF, promiscuous mode forced on


some params that are set for this nic's
    ip link set up dev $i
    ethtool -A $i autoneg off rx off tx off
    ethtool -G $i rx 1024 tx 2048
    ip link set $i txqueuelen 1000
    ethtool -C $i adaptive-rx off adaptive-tx off 
rx-usecs 512 tx-usecs 128

    ethtool -L $i combined 6
    #ethtool -N $i rx-flow-hash udp4 sdfn
    ethtool -K $i ntuple on
    ethtool -K $i gro off
    ethtool -K $i tso off




Also after TSO/GRO on there is memory usage change - and 
leaking faster
Below image from memory usage before change with TSO/GRO OFF 
and after enabling TSO/GRO


https://ibb.co/dTqBY6


Thanks
Pawel




With settings like this:
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 
enp3s0f2 enp3s0f3'

for i in $ifc
    do
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 
512 tx-usecs 128

    ethtool -K $i gro on
    ethtool -K $i tso on

    done

Server is leaking about 4-6MB per each 10 seconds
MEMLEAK:
5  MB/10sec
6  MB/10sec
4  MB/10sec
4  MB/10sec


Other settings TSO/GRO off
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 
enp3s0f2 enp3s0f3'

for i in $ifc
    do
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 
512 tx-usecs 128

    ethtool -K $i gro off
    ethtool -K $i tso off

    done

Same leak about 5MB per 10 seconds
MEMLEAK:
5  MB/10sec
5  MB/

Re: Linux 4.12+ memory leak on router with i40e NICs




W dniu 2017-10-19 o 00:20, Paweł Staszewski pisze:



W dniu 2017-10-18 o 17:44, Paweł Staszewski pisze:



W dniu 2017-10-17 o 16:08, Paweł Staszewski pisze:



W dniu 2017-10-17 o 13:52, Paweł Staszewski pisze:



W dniu 2017-10-17 o 13:05, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:59, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:51, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:20, Paweł Staszewski pisze:



W dniu 2017-10-17 o 11:48, Paweł Staszewski pisze:



W dniu 2017-10-17 o 02:44, Paweł Staszewski pisze:



W dniu 2017-10-17 o 01:56, Alexander Duyck pisze:
On Mon, Oct 16, 2017 at 4:34 PM, Paweł Staszewski 
 wrote:


W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze:



W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze:

On 15/10/2017 02:58 πμ, Alexander Duyck wrote:

Hi Pawel,

To clarify is that Dave Miller's tree or Linus's that 
you are talking
about? If it is Dave's tree how long ago was it you 
pulled it since I
think the fix was just pushed by Jeff Kirsher a few days 
ago.


The issue should be fixed in the following commit:

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 



Do you know when it is going to be available on net-next 
and linux-stable

repos?

Cheers,
Pavlos


I will make some tests today night with "net" git tree 
where this patch is

included.
Starting from 0:00 CET
:)



Upgraded and looks like problem is not solved with that patch
Currently running system with
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/
kernel

Still about 0.5GB of memory is leaking somewhere

Also can confirm that the latest kernel where memory is not 
leaking (with

use i40e driver intel 710 cards) is 4.11.12
With kernel 4.11.12 - after hour no change in memory usage.

also checked that with ixgbe instead of i40e with same 
net.git kernel there
is no memleak - after hour same memory usage - so for 100% 
this is i40e

driver problem.

So how long was the run to get the .5GB of memory leaking?

1 hour



Also is there any chance of you being able to bisect to 
determine
where the memory leak was introduced since as you pointed 
out it

didn't exist in 4.11.12 so odds are it was introduced somewhere
between 4.11 and the latest kernel release.
Can be hard cause currently need to back to 4.11.12 - this is 
production host/router
Will try to find some free/test router for tests/bicects with 
i40e driver (intel 710 cards)




Thanks.

- Alex





Also forgoto to add errors for i40e when driver initialize:
[   15.760569] i40e :02:00.1: Error I40E_AQ_RC_ENOSPC 
adding RX filters on PF, promiscuous mode forced on
[   16.365587] i40e :03:00.3: Error I40E_AQ_RC_ENOSPC 
adding RX filters on PF, promiscuous mode forced on
[   16.367686] i40e :02:00.2: Error I40E_AQ_RC_ENOSPC 
adding RX filters on PF, promiscuous mode forced on
[   16.368816] i40e :03:00.0: Error I40E_AQ_RC_ENOSPC 
adding RX filters on PF, promiscuous mode forced on
[   16.369877] i40e :03:00.2: Error I40E_AQ_RC_ENOSPC 
adding RX filters on PF, promiscuous mode forced on
[   16.370941] i40e :02:00.3: Error I40E_AQ_RC_ENOSPC 
adding RX filters on PF, promiscuous mode forced on
[   16.372005] i40e :02:00.0: Error I40E_AQ_RC_ENOSPC 
adding RX filters on PF, promiscuous mode forced on
[   16.373029] i40e :03:00.1: Error I40E_AQ_RC_ENOSPC 
adding RX filters on PF, promiscuous mode forced on


some params that are set for this nic's
    ip link set up dev $i
    ethtool -A $i autoneg off rx off tx off
    ethtool -G $i rx 1024 tx 2048
    ip link set $i txqueuelen 1000
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 
512 tx-usecs 128

    ethtool -L $i combined 6
    #ethtool -N $i rx-flow-hash udp4 sdfn
    ethtool -K $i ntuple on
    ethtool -K $i gro off
    ethtool -K $i tso off




Also after TSO/GRO on there is memory usage change - and 
leaking faster
Below image from memory usage before change with TSO/GRO OFF 
and after enabling TSO/GRO


https://ibb.co/dTqBY6


Thanks
Pawel




With settings like this:
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 
enp3s0f2 enp3s0f3'

for i in $ifc
    do
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 
512 tx-usecs 128

    ethtool -K $i gro on
    ethtool -K $i tso on

    done

Server is leaking about 4-6MB per each 10 seconds
MEMLEAK:
5  MB/10sec
6  MB/10sec
4  MB/10sec
4  MB/10sec


Other settings TSO/GRO off
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 
enp3s0f2 enp3s0f3'

for i in $ifc
    do
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 
512 tx-usecs 128

    ethtool -K $i gro off
    ethtool -K $i tso off

    done

Same leak about 5MB per 10 seconds
MEMLEAK:
5  MB/10sec
5  MB/10sec
5  MB/10sec


Other settings rx-usecs change fro

Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on




W dniu 2017-10-18 o 23:54, Eric Dumazet pisze:

On Wed, 2017-10-18 at 23:49 +0200, Paweł Staszewski wrote:


How far it is from applying this to the kernel ?

So far im using this on all my servers from about 3 months now without
problems

It is a hack, and does not support properly bonding/team.

( If the real_dev->privflags IFF_XMIT_DST_RELEASE bit changes,
  we want to update all the vlans at the same time )

We need something more sophisticated, and I had no time to spend on
this topic recently.






ok

Re: Linux 4.12+ memory leak on router with i40e NICs




W dniu 2017-10-18 o 17:44, Paweł Staszewski pisze:



W dniu 2017-10-17 o 16:08, Paweł Staszewski pisze:



W dniu 2017-10-17 o 13:52, Paweł Staszewski pisze:



W dniu 2017-10-17 o 13:05, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:59, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:51, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:20, Paweł Staszewski pisze:



W dniu 2017-10-17 o 11:48, Paweł Staszewski pisze:



W dniu 2017-10-17 o 02:44, Paweł Staszewski pisze:



W dniu 2017-10-17 o 01:56, Alexander Duyck pisze:
On Mon, Oct 16, 2017 at 4:34 PM, Paweł Staszewski 
 wrote:


W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze:



W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze:

On 15/10/2017 02:58 πμ, Alexander Duyck wrote:

Hi Pawel,

To clarify is that Dave Miller's tree or Linus's that you 
are talking
about? If it is Dave's tree how long ago was it you 
pulled it since I
think the fix was just pushed by Jeff Kirsher a few days 
ago.


The issue should be fixed in the following commit:

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 



Do you know when it is going to be available on net-next 
and linux-stable

repos?

Cheers,
Pavlos


I will make some tests today night with "net" git tree 
where this patch is

included.
Starting from 0:00 CET
:)



Upgraded and looks like problem is not solved with that patch
Currently running system with
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/
kernel

Still about 0.5GB of memory is leaking somewhere

Also can confirm that the latest kernel where memory is not 
leaking (with

use i40e driver intel 710 cards) is 4.11.12
With kernel 4.11.12 - after hour no change in memory usage.

also checked that with ixgbe instead of i40e with same 
net.git kernel there
is no memleak - after hour same memory usage - so for 100% 
this is i40e

driver problem.

So how long was the run to get the .5GB of memory leaking?

1 hour



Also is there any chance of you being able to bisect to 
determine

where the memory leak was introduced since as you pointed out it
didn't exist in 4.11.12 so odds are it was introduced somewhere
between 4.11 and the latest kernel release.
Can be hard cause currently need to back to 4.11.12 - this is 
production host/router
Will try to find some free/test router for tests/bicects with 
i40e driver (intel 710 cards)




Thanks.

- Alex





Also forgoto to add errors for i40e when driver initialize:
[   15.760569] i40e :02:00.1: Error I40E_AQ_RC_ENOSPC 
adding RX filters on PF, promiscuous mode forced on
[   16.365587] i40e :03:00.3: Error I40E_AQ_RC_ENOSPC 
adding RX filters on PF, promiscuous mode forced on
[   16.367686] i40e :02:00.2: Error I40E_AQ_RC_ENOSPC 
adding RX filters on PF, promiscuous mode forced on
[   16.368816] i40e :03:00.0: Error I40E_AQ_RC_ENOSPC 
adding RX filters on PF, promiscuous mode forced on
[   16.369877] i40e :03:00.2: Error I40E_AQ_RC_ENOSPC 
adding RX filters on PF, promiscuous mode forced on
[   16.370941] i40e :02:00.3: Error I40E_AQ_RC_ENOSPC 
adding RX filters on PF, promiscuous mode forced on
[   16.372005] i40e :02:00.0: Error I40E_AQ_RC_ENOSPC 
adding RX filters on PF, promiscuous mode forced on
[   16.373029] i40e :03:00.1: Error I40E_AQ_RC_ENOSPC 
adding RX filters on PF, promiscuous mode forced on


some params that are set for this nic's
    ip link set up dev $i
    ethtool -A $i autoneg off rx off tx off
    ethtool -G $i rx 1024 tx 2048
    ip link set $i txqueuelen 1000
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 
512 tx-usecs 128

    ethtool -L $i combined 6
    #ethtool -N $i rx-flow-hash udp4 sdfn
    ethtool -K $i ntuple on
    ethtool -K $i gro off
    ethtool -K $i tso off




Also after TSO/GRO on there is memory usage change - and leaking 
faster
Below image from memory usage before change with TSO/GRO OFF and 
after enabling TSO/GRO


https://ibb.co/dTqBY6


Thanks
Pawel




With settings like this:
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 
enp3s0f2 enp3s0f3'

for i in $ifc
    do
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 
512 tx-usecs 128

    ethtool -K $i gro on
    ethtool -K $i tso on

    done

Server is leaking about 4-6MB per each 10 seconds
MEMLEAK:
5  MB/10sec
6  MB/10sec
4  MB/10sec
4  MB/10sec


Other settings TSO/GRO off
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 
enp3s0f2 enp3s0f3'

for i in $ifc
    do
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 
512 tx-usecs 128

    ethtool -K $i gro off
    ethtool -K $i tso off

    done

Same leak about 5MB per 10 seconds
MEMLEAK:
5  MB/10sec
5  MB/10sec
5  MB/10sec


Other settings rx-usecs change from 512 to 1024:
ifc='enp2s0f0 enp2s0f1 enp2s0f2 e

Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on


W dniu 2017-09-21 o 23:41, Florian Fainelli pisze:

On 09/21/2017 02:26 PM, Paweł Staszewski wrote:


W dniu 2017-08-15 o 11:11, Paweł Staszewski pisze:

diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c
index
5e831de3103e2f7092c7fa15534def403bc62fb4..9472de846d5c0960996261cb2843032847fa4bf7
100644
--- a/net/8021q/vlan_netlink.c
+++ b/net/8021q/vlan_netlink.c
@@ -143,6 +143,7 @@ static int vlan_newlink(struct net *src_net,
struct net_device *dev,
   vlan->vlan_proto = proto;
   vlan->vlan_id = nla_get_u16(data[IFLA_VLAN_ID]);
   vlan->real_dev = real_dev;
+dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE);
   vlan->flags = VLAN_FLAG_REORDER_HDR;
 err = vlan_check_real_dev(real_dev, vlan->vlan_proto,
vlan->vlan_id);

Any plans for this patch to go normal into the kernel ?

Would not this apply to pretty much any stacked device setup though? It
seems like any network device that just queues up its packet on another
physical device for actual transmission may need that (e.g: DSA, bond,
team, more.?)


How far it is from applying this to the kernel ?

So far im using this on all my servers from about 3 months now without 
problems

Re: Linux 4.12+ memory leak on router with i40e NICs




W dniu 2017-10-17 o 16:08, Paweł Staszewski pisze:



W dniu 2017-10-17 o 13:52, Paweł Staszewski pisze:



W dniu 2017-10-17 o 13:05, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:59, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:51, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:20, Paweł Staszewski pisze:



W dniu 2017-10-17 o 11:48, Paweł Staszewski pisze:



W dniu 2017-10-17 o 02:44, Paweł Staszewski pisze:



W dniu 2017-10-17 o 01:56, Alexander Duyck pisze:
On Mon, Oct 16, 2017 at 4:34 PM, Paweł Staszewski 
 wrote:


W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze:



W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze:

On 15/10/2017 02:58 πμ, Alexander Duyck wrote:

Hi Pawel,

To clarify is that Dave Miller's tree or Linus's that you 
are talking
about? If it is Dave's tree how long ago was it you pulled 
it since I

think the fix was just pushed by Jeff Kirsher a few days ago.

The issue should be fixed in the following commit:

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 



Do you know when it is going to be available on net-next 
and linux-stable

repos?

Cheers,
Pavlos


I will make some tests today night with "net" git tree where 
this patch is

included.
Starting from 0:00 CET
:)



Upgraded and looks like problem is not solved with that patch
Currently running system with
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/
kernel

Still about 0.5GB of memory is leaking somewhere

Also can confirm that the latest kernel where memory is not 
leaking (with

use i40e driver intel 710 cards) is 4.11.12
With kernel 4.11.12 - after hour no change in memory usage.

also checked that with ixgbe instead of i40e with same 
net.git kernel there
is no memleak - after hour same memory usage - so for 100% 
this is i40e

driver problem.

So how long was the run to get the .5GB of memory leaking?

1 hour



Also is there any chance of you being able to bisect to determine
where the memory leak was introduced since as you pointed out it
didn't exist in 4.11.12 so odds are it was introduced somewhere
between 4.11 and the latest kernel release.
Can be hard cause currently need to back to 4.11.12 - this is 
production host/router
Will try to find some free/test router for tests/bicects with 
i40e driver (intel 710 cards)




Thanks.

- Alex





Also forgoto to add errors for i40e when driver initialize:
[   15.760569] i40e :02:00.1: Error I40E_AQ_RC_ENOSPC adding 
RX filters on PF, promiscuous mode forced on
[   16.365587] i40e :03:00.3: Error I40E_AQ_RC_ENOSPC adding 
RX filters on PF, promiscuous mode forced on
[   16.367686] i40e :02:00.2: Error I40E_AQ_RC_ENOSPC adding 
RX filters on PF, promiscuous mode forced on
[   16.368816] i40e :03:00.0: Error I40E_AQ_RC_ENOSPC adding 
RX filters on PF, promiscuous mode forced on
[   16.369877] i40e :03:00.2: Error I40E_AQ_RC_ENOSPC adding 
RX filters on PF, promiscuous mode forced on
[   16.370941] i40e :02:00.3: Error I40E_AQ_RC_ENOSPC adding 
RX filters on PF, promiscuous mode forced on
[   16.372005] i40e :02:00.0: Error I40E_AQ_RC_ENOSPC adding 
RX filters on PF, promiscuous mode forced on
[   16.373029] i40e :03:00.1: Error I40E_AQ_RC_ENOSPC adding 
RX filters on PF, promiscuous mode forced on


some params that are set for this nic's
    ip link set up dev $i
    ethtool -A $i autoneg off rx off tx off
    ethtool -G $i rx 1024 tx 2048
    ip link set $i txqueuelen 1000
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 
512 tx-usecs 128

    ethtool -L $i combined 6
    #ethtool -N $i rx-flow-hash udp4 sdfn
    ethtool -K $i ntuple on
    ethtool -K $i gro off
    ethtool -K $i tso off




Also after TSO/GRO on there is memory usage change - and leaking 
faster
Below image from memory usage before change with TSO/GRO OFF and 
after enabling TSO/GRO


https://ibb.co/dTqBY6


Thanks
Pawel




With settings like this:
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 
enp3s0f2 enp3s0f3'

for i in $ifc
    do
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 
tx-usecs 128

    ethtool -K $i gro on
    ethtool -K $i tso on

    done

Server is leaking about 4-6MB per each 10 seconds
MEMLEAK:
5  MB/10sec
6  MB/10sec
4  MB/10sec
4  MB/10sec


Other settings TSO/GRO off
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 
enp3s0f2 enp3s0f3'

for i in $ifc
    do
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 
tx-usecs 128

    ethtool -K $i gro off
    ethtool -K $i tso off

    done

Same leak about 5MB per 10 seconds
MEMLEAK:
5  MB/10sec
5  MB/10sec
5  MB/10sec


Other settings rx-usecs change from 512 to 1024:
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 
enp3s0f2 enp3s0f3'

for i in

Re: Linux 4.12+ memory leak on router with i40e NICs




W dniu 2017-10-17 o 13:52, Paweł Staszewski pisze:



W dniu 2017-10-17 o 13:05, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:59, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:51, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:20, Paweł Staszewski pisze:



W dniu 2017-10-17 o 11:48, Paweł Staszewski pisze:



W dniu 2017-10-17 o 02:44, Paweł Staszewski pisze:



W dniu 2017-10-17 o 01:56, Alexander Duyck pisze:
On Mon, Oct 16, 2017 at 4:34 PM, Paweł Staszewski 
 wrote:


W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze:



W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze:

On 15/10/2017 02:58 πμ, Alexander Duyck wrote:

Hi Pawel,

To clarify is that Dave Miller's tree or Linus's that you 
are talking
about? If it is Dave's tree how long ago was it you pulled 
it since I

think the fix was just pushed by Jeff Kirsher a few days ago.

The issue should be fixed in the following commit:

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 



Do you know when it is going to be available on net-next and 
linux-stable

repos?

Cheers,
Pavlos


I will make some tests today night with "net" git tree where 
this patch is

included.
Starting from 0:00 CET
:)



Upgraded and looks like problem is not solved with that patch
Currently running system with
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/
kernel

Still about 0.5GB of memory is leaking somewhere

Also can confirm that the latest kernel where memory is not 
leaking (with

use i40e driver intel 710 cards) is 4.11.12
With kernel 4.11.12 - after hour no change in memory usage.

also checked that with ixgbe instead of i40e with same net.git 
kernel there
is no memleak - after hour same memory usage - so for 100% 
this is i40e

driver problem.

So how long was the run to get the .5GB of memory leaking?

1 hour



Also is there any chance of you being able to bisect to determine
where the memory leak was introduced since as you pointed out it
didn't exist in 4.11.12 so odds are it was introduced somewhere
between 4.11 and the latest kernel release.
Can be hard cause currently need to back to 4.11.12 - this is 
production host/router
Will try to find some free/test router for tests/bicects with 
i40e driver (intel 710 cards)




Thanks.

- Alex





Also forgoto to add errors for i40e when driver initialize:
[   15.760569] i40e :02:00.1: Error I40E_AQ_RC_ENOSPC adding 
RX filters on PF, promiscuous mode forced on
[   16.365587] i40e :03:00.3: Error I40E_AQ_RC_ENOSPC adding 
RX filters on PF, promiscuous mode forced on
[   16.367686] i40e :02:00.2: Error I40E_AQ_RC_ENOSPC adding 
RX filters on PF, promiscuous mode forced on
[   16.368816] i40e :03:00.0: Error I40E_AQ_RC_ENOSPC adding 
RX filters on PF, promiscuous mode forced on
[   16.369877] i40e :03:00.2: Error I40E_AQ_RC_ENOSPC adding 
RX filters on PF, promiscuous mode forced on
[   16.370941] i40e :02:00.3: Error I40E_AQ_RC_ENOSPC adding 
RX filters on PF, promiscuous mode forced on
[   16.372005] i40e :02:00.0: Error I40E_AQ_RC_ENOSPC adding 
RX filters on PF, promiscuous mode forced on
[   16.373029] i40e :03:00.1: Error I40E_AQ_RC_ENOSPC adding 
RX filters on PF, promiscuous mode forced on


some params that are set for this nic's
    ip link set up dev $i
    ethtool -A $i autoneg off rx off tx off
    ethtool -G $i rx 1024 tx 2048
    ip link set $i txqueuelen 1000
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 
512 tx-usecs 128

    ethtool -L $i combined 6
    #ethtool -N $i rx-flow-hash udp4 sdfn
    ethtool -K $i ntuple on
    ethtool -K $i gro off
    ethtool -K $i tso off




Also after TSO/GRO on there is memory usage change - and leaking 
faster
Below image from memory usage before change with TSO/GRO OFF and 
after enabling TSO/GRO


https://ibb.co/dTqBY6


Thanks
Pawel




With settings like this:
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 
enp3s0f3'

for i in $ifc
    do
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 
tx-usecs 128

    ethtool -K $i gro on
    ethtool -K $i tso on

    done

Server is leaking about 4-6MB per each 10 seconds
MEMLEAK:
5  MB/10sec
6  MB/10sec
4  MB/10sec
4  MB/10sec


Other settings TSO/GRO off
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 
enp3s0f3'

for i in $ifc
    do
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 
tx-usecs 128

    ethtool -K $i gro off
    ethtool -K $i tso off

    done

Same leak about 5MB per 10 seconds
MEMLEAK:
5  MB/10sec
5  MB/10sec
5  MB/10sec


Other settings rx-usecs change from 512 to 1024:
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 
enp3s0f3'

for i in $ifc
    do
    ethtool -C $i adaptive-rx off a

Re: Linux 4.12+ memory leak on router with i40e NICs




W dniu 2017-10-17 o 13:05, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:59, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:51, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:20, Paweł Staszewski pisze:



W dniu 2017-10-17 o 11:48, Paweł Staszewski pisze:



W dniu 2017-10-17 o 02:44, Paweł Staszewski pisze:



W dniu 2017-10-17 o 01:56, Alexander Duyck pisze:
On Mon, Oct 16, 2017 at 4:34 PM, Paweł Staszewski 
 wrote:


W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze:



W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze:

On 15/10/2017 02:58 πμ, Alexander Duyck wrote:

Hi Pawel,

To clarify is that Dave Miller's tree or Linus's that you 
are talking
about? If it is Dave's tree how long ago was it you pulled 
it since I

think the fix was just pushed by Jeff Kirsher a few days ago.

The issue should be fixed in the following commit:

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 



Do you know when it is going to be available on net-next and 
linux-stable

repos?

Cheers,
Pavlos


I will make some tests today night with "net" git tree where 
this patch is

included.
Starting from 0:00 CET
:)



Upgraded and looks like problem is not solved with that patch
Currently running system with
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/
kernel

Still about 0.5GB of memory is leaking somewhere

Also can confirm that the latest kernel where memory is not 
leaking (with

use i40e driver intel 710 cards) is 4.11.12
With kernel 4.11.12 - after hour no change in memory usage.

also checked that with ixgbe instead of i40e with same net.git 
kernel there
is no memleak - after hour same memory usage - so for 100% this 
is i40e

driver problem.

So how long was the run to get the .5GB of memory leaking?

1 hour



Also is there any chance of you being able to bisect to determine
where the memory leak was introduced since as you pointed out it
didn't exist in 4.11.12 so odds are it was introduced somewhere
between 4.11 and the latest kernel release.
Can be hard cause currently need to back to 4.11.12 - this is 
production host/router
Will try to find some free/test router for tests/bicects with 
i40e driver (intel 710 cards)




Thanks.

- Alex





Also forgoto to add errors for i40e when driver initialize:
[   15.760569] i40e :02:00.1: Error I40E_AQ_RC_ENOSPC adding 
RX filters on PF, promiscuous mode forced on
[   16.365587] i40e :03:00.3: Error I40E_AQ_RC_ENOSPC adding 
RX filters on PF, promiscuous mode forced on
[   16.367686] i40e :02:00.2: Error I40E_AQ_RC_ENOSPC adding 
RX filters on PF, promiscuous mode forced on
[   16.368816] i40e :03:00.0: Error I40E_AQ_RC_ENOSPC adding 
RX filters on PF, promiscuous mode forced on
[   16.369877] i40e :03:00.2: Error I40E_AQ_RC_ENOSPC adding 
RX filters on PF, promiscuous mode forced on
[   16.370941] i40e :02:00.3: Error I40E_AQ_RC_ENOSPC adding 
RX filters on PF, promiscuous mode forced on
[   16.372005] i40e :02:00.0: Error I40E_AQ_RC_ENOSPC adding 
RX filters on PF, promiscuous mode forced on
[   16.373029] i40e :03:00.1: Error I40E_AQ_RC_ENOSPC adding 
RX filters on PF, promiscuous mode forced on


some params that are set for this nic's
    ip link set up dev $i
    ethtool -A $i autoneg off rx off tx off
    ethtool -G $i rx 1024 tx 2048
    ip link set $i txqueuelen 1000
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 
tx-usecs 128

    ethtool -L $i combined 6
    #ethtool -N $i rx-flow-hash udp4 sdfn
    ethtool -K $i ntuple on
    ethtool -K $i gro off
    ethtool -K $i tso off




Also after TSO/GRO on there is memory usage change - and leaking 
faster
Below image from memory usage before change with TSO/GRO OFF and 
after enabling TSO/GRO


https://ibb.co/dTqBY6


Thanks
Pawel




With settings like this:
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 
enp3s0f3'

for i in $ifc
    do
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 
tx-usecs 128

    ethtool -K $i gro on
    ethtool -K $i tso on

    done

Server is leaking about 4-6MB per each 10 seconds
MEMLEAK:
5  MB/10sec
6  MB/10sec
4  MB/10sec
4  MB/10sec


Other settings TSO/GRO off
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 
enp3s0f3'

for i in $ifc
    do
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 
tx-usecs 128

    ethtool -K $i gro off
    ethtool -K $i tso off

    done

Same leak about 5MB per 10 seconds
MEMLEAK:
5  MB/10sec
5  MB/10sec
5  MB/10sec


Other settings rx-usecs change from 512 to 1024:
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 
enp3s0f3'

for i in $ifc
    do
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 1024 
tx-usecs 128

Re: Linux 4.12+ memory leak on router with i40e NICs




W dniu 2017-10-17 o 12:59, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:51, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:20, Paweł Staszewski pisze:



W dniu 2017-10-17 o 11:48, Paweł Staszewski pisze:



W dniu 2017-10-17 o 02:44, Paweł Staszewski pisze:



W dniu 2017-10-17 o 01:56, Alexander Duyck pisze:
On Mon, Oct 16, 2017 at 4:34 PM, Paweł Staszewski 
 wrote:


W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze:



W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze:

On 15/10/2017 02:58 πμ, Alexander Duyck wrote:

Hi Pawel,

To clarify is that Dave Miller's tree or Linus's that you are 
talking
about? If it is Dave's tree how long ago was it you pulled it 
since I

think the fix was just pushed by Jeff Kirsher a few days ago.

The issue should be fixed in the following commit:

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 



Do you know when it is going to be available on net-next and 
linux-stable

repos?

Cheers,
Pavlos


I will make some tests today night with "net" git tree where 
this patch is

included.
Starting from 0:00 CET
:)



Upgraded and looks like problem is not solved with that patch
Currently running system with
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/
kernel

Still about 0.5GB of memory is leaking somewhere

Also can confirm that the latest kernel where memory is not 
leaking (with

use i40e driver intel 710 cards) is 4.11.12
With kernel 4.11.12 - after hour no change in memory usage.

also checked that with ixgbe instead of i40e with same net.git 
kernel there
is no memleak - after hour same memory usage - so for 100% this 
is i40e

driver problem.

So how long was the run to get the .5GB of memory leaking?

1 hour



Also is there any chance of you being able to bisect to determine
where the memory leak was introduced since as you pointed out it
didn't exist in 4.11.12 so odds are it was introduced somewhere
between 4.11 and the latest kernel release.
Can be hard cause currently need to back to 4.11.12 - this is 
production host/router
Will try to find some free/test router for tests/bicects with i40e 
driver (intel 710 cards)




Thanks.

- Alex





Also forgoto to add errors for i40e when driver initialize:
[   15.760569] i40e :02:00.1: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.365587] i40e :03:00.3: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.367686] i40e :02:00.2: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.368816] i40e :03:00.0: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.369877] i40e :03:00.2: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.370941] i40e :02:00.3: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.372005] i40e :02:00.0: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.373029] i40e :03:00.1: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on


some params that are set for this nic's
    ip link set up dev $i
    ethtool -A $i autoneg off rx off tx off
    ethtool -G $i rx 1024 tx 2048
    ip link set $i txqueuelen 1000
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 
tx-usecs 128

    ethtool -L $i combined 6
    #ethtool -N $i rx-flow-hash udp4 sdfn
    ethtool -K $i ntuple on
    ethtool -K $i gro off
    ethtool -K $i tso off





Also after TSO/GRO on there is memory usage change - and leaking faster
Below image from memory usage before change with TSO/GRO OFF and 
after enabling TSO/GRO


https://ibb.co/dTqBY6


Thanks
Pawel




With settings like this:
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 
enp3s0f3'

for i in $ifc
    do
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 
tx-usecs 128

    ethtool -K $i gro on
    ethtool -K $i tso on

    done

Server is leaking about 4-6MB per each 10 seconds
MEMLEAK:
5  MB/10sec
6  MB/10sec
4  MB/10sec
4  MB/10sec


Other settings TSO/GRO off
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 
enp3s0f3'

for i in $ifc
    do
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 
tx-usecs 128

    ethtool -K $i gro off
    ethtool -K $i tso off

    done

Same leak about 5MB per 10 seconds
MEMLEAK:
5  MB/10sec
5  MB/10sec
5  MB/10sec


Other settings rx-usecs change from 512 to 1024:
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 
enp3s0f3'

for i in $ifc
    do
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 1024 
tx-usecs 128

    ethtool -K $i gro off
    ethtool -K $i tso off

Re: Linux 4.12+ memory leak on router with i40e NICs




W dniu 2017-10-17 o 12:51, Paweł Staszewski pisze:



W dniu 2017-10-17 o 12:20, Paweł Staszewski pisze:



W dniu 2017-10-17 o 11:48, Paweł Staszewski pisze:



W dniu 2017-10-17 o 02:44, Paweł Staszewski pisze:



W dniu 2017-10-17 o 01:56, Alexander Duyck pisze:
On Mon, Oct 16, 2017 at 4:34 PM, Paweł Staszewski 
 wrote:


W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze:



W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze:

On 15/10/2017 02:58 πμ, Alexander Duyck wrote:

Hi Pawel,

To clarify is that Dave Miller's tree or Linus's that you are 
talking
about? If it is Dave's tree how long ago was it you pulled it 
since I

think the fix was just pushed by Jeff Kirsher a few days ago.

The issue should be fixed in the following commit:

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 



Do you know when it is going to be available on net-next and 
linux-stable

repos?

Cheers,
Pavlos


I will make some tests today night with "net" git tree where 
this patch is

included.
Starting from 0:00 CET
:)



Upgraded and looks like problem is not solved with that patch
Currently running system with
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/
kernel

Still about 0.5GB of memory is leaking somewhere

Also can confirm that the latest kernel where memory is not 
leaking (with

use i40e driver intel 710 cards) is 4.11.12
With kernel 4.11.12 - after hour no change in memory usage.

also checked that with ixgbe instead of i40e with same net.git 
kernel there
is no memleak - after hour same memory usage - so for 100% this 
is i40e

driver problem.

So how long was the run to get the .5GB of memory leaking?

1 hour



Also is there any chance of you being able to bisect to determine
where the memory leak was introduced since as you pointed out it
didn't exist in 4.11.12 so odds are it was introduced somewhere
between 4.11 and the latest kernel release.
Can be hard cause currently need to back to 4.11.12 - this is 
production host/router
Will try to find some free/test router for tests/bicects with i40e 
driver (intel 710 cards)




Thanks.

- Alex





Also forgoto to add errors for i40e when driver initialize:
[   15.760569] i40e :02:00.1: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.365587] i40e :03:00.3: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.367686] i40e :02:00.2: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.368816] i40e :03:00.0: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.369877] i40e :03:00.2: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.370941] i40e :02:00.3: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.372005] i40e :02:00.0: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.373029] i40e :03:00.1: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on


some params that are set for this nic's
    ip link set up dev $i
    ethtool -A $i autoneg off rx off tx off
    ethtool -G $i rx 1024 tx 2048
    ip link set $i txqueuelen 1000
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 
tx-usecs 128

    ethtool -L $i combined 6
    #ethtool -N $i rx-flow-hash udp4 sdfn
    ethtool -K $i ntuple on
    ethtool -K $i gro off
    ethtool -K $i tso off





Also after TSO/GRO on there is memory usage change - and leaking faster
Below image from memory usage before change with TSO/GRO OFF and 
after enabling TSO/GRO


https://ibb.co/dTqBY6


Thanks
Pawel




With settings like this:
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 
enp3s0f3'

for i in $ifc
    do
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 
tx-usecs 128

    ethtool -K $i gro on
    ethtool -K $i tso on

    done

Server is leaking about 4-6MB per each 10 seconds
MEMLEAK:
5  MB/10sec
6  MB/10sec
4  MB/10sec
4  MB/10sec


Other settings TSO/GRO off
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 
enp3s0f3'

for i in $ifc
    do
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 
tx-usecs 128

    ethtool -K $i gro off
    ethtool -K $i tso off

    done

Same leak about 5MB per 10 seconds
MEMLEAK:
5  MB/10sec
5  MB/10sec
5  MB/10sec


Other settings rx-usecs change from 512 to 1024:
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 
enp3s0f3'

for i in $ifc
    do
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 1024 
tx-usecs 128

    ethtool -K $i gro off
    ethtool -K $i tso off

    done

MEMLEAK:
4  MB/10sec
3  MB/10sec
4  MB/

Re: Linux 4.12+ memory leak on router with i40e NICs




W dniu 2017-10-17 o 12:20, Paweł Staszewski pisze:



W dniu 2017-10-17 o 11:48, Paweł Staszewski pisze:



W dniu 2017-10-17 o 02:44, Paweł Staszewski pisze:



W dniu 2017-10-17 o 01:56, Alexander Duyck pisze:
On Mon, Oct 16, 2017 at 4:34 PM, Paweł Staszewski 
 wrote:


W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze:



W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze:

On 15/10/2017 02:58 πμ, Alexander Duyck wrote:

Hi Pawel,

To clarify is that Dave Miller's tree or Linus's that you are 
talking
about? If it is Dave's tree how long ago was it you pulled it 
since I

think the fix was just pushed by Jeff Kirsher a few days ago.

The issue should be fixed in the following commit:

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 



Do you know when it is going to be available on net-next and 
linux-stable

repos?

Cheers,
Pavlos


I will make some tests today night with "net" git tree where this 
patch is

included.
Starting from 0:00 CET
:)



Upgraded and looks like problem is not solved with that patch
Currently running system with
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/
kernel

Still about 0.5GB of memory is leaking somewhere

Also can confirm that the latest kernel where memory is not 
leaking (with

use i40e driver intel 710 cards) is 4.11.12
With kernel 4.11.12 - after hour no change in memory usage.

also checked that with ixgbe instead of i40e with same net.git 
kernel there
is no memleak - after hour same memory usage - so for 100% this is 
i40e

driver problem.

So how long was the run to get the .5GB of memory leaking?

1 hour



Also is there any chance of you being able to bisect to determine
where the memory leak was introduced since as you pointed out it
didn't exist in 4.11.12 so odds are it was introduced somewhere
between 4.11 and the latest kernel release.
Can be hard cause currently need to back to 4.11.12 - this is 
production host/router
Will try to find some free/test router for tests/bicects with i40e 
driver (intel 710 cards)




Thanks.

- Alex





Also forgoto to add errors for i40e when driver initialize:
[   15.760569] i40e :02:00.1: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.365587] i40e :03:00.3: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.367686] i40e :02:00.2: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.368816] i40e :03:00.0: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.369877] i40e :03:00.2: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.370941] i40e :02:00.3: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.372005] i40e :02:00.0: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.373029] i40e :03:00.1: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on


some params that are set for this nic's
    ip link set up dev $i
    ethtool -A $i autoneg off rx off tx off
    ethtool -G $i rx 1024 tx 2048
    ip link set $i txqueuelen 1000
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 
tx-usecs 128

    ethtool -L $i combined 6
    #ethtool -N $i rx-flow-hash udp4 sdfn
    ethtool -K $i ntuple on
    ethtool -K $i gro off
    ethtool -K $i tso off





Also after TSO/GRO on there is memory usage change - and leaking faster
Below image from memory usage before change with TSO/GRO OFF and after 
enabling TSO/GRO


https://ibb.co/dTqBY6


Thanks
Pawel




With settings like this:
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 
enp3s0f3'

for i in $ifc
    do
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 
tx-usecs 128

    ethtool -K $i gro on
    ethtool -K $i tso on

    done

Server is leaking about 4-6MB per each 10 seconds
MEMLEAK:
5  MB/10sec
6  MB/10sec
4  MB/10sec
4  MB/10sec


Other settings TSO/GRO off
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 
enp3s0f3'

for i in $ifc
    do
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 
tx-usecs 128

    ethtool -K $i gro off
    ethtool -K $i tso off

    done

Same leak about 5MB per 10 seconds
MEMLEAK:
5  MB/10sec
5  MB/10sec
5  MB/10sec


Other settings rx-usecs change from 512 to 1024:
ifc='enp2s0f0 enp2s0f1 enp2s0f2 enp2s0f3 enp3s0f0 enp3s0f1 enp3s0f2 
enp3s0f3'

for i in $ifc
    do
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 1024 
tx-usecs 128

    ethtool -K $i gro off
    ethtool -K $i tso off

    done

MEMLEAK:
4  MB/10sec
3  MB/10sec
4  MB/10sec
4  MB/10sec


So memleak have something to do with

Re: Linux 4.12+ memory leak on router with i40e NICs




W dniu 2017-10-17 o 11:48, Paweł Staszewski pisze:



W dniu 2017-10-17 o 02:44, Paweł Staszewski pisze:



W dniu 2017-10-17 o 01:56, Alexander Duyck pisze:
On Mon, Oct 16, 2017 at 4:34 PM, Paweł Staszewski 
 wrote:


W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze:



W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze:

On 15/10/2017 02:58 πμ, Alexander Duyck wrote:

Hi Pawel,

To clarify is that Dave Miller's tree or Linus's that you are 
talking
about? If it is Dave's tree how long ago was it you pulled it 
since I

think the fix was just pushed by Jeff Kirsher a few days ago.

The issue should be fixed in the following commit:

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 



Do you know when it is going to be available on net-next and 
linux-stable

repos?

Cheers,
Pavlos


I will make some tests today night with "net" git tree where this 
patch is

included.
Starting from 0:00 CET
:)



Upgraded and looks like problem is not solved with that patch
Currently running system with
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/
kernel

Still about 0.5GB of memory is leaking somewhere

Also can confirm that the latest kernel where memory is not leaking 
(with

use i40e driver intel 710 cards) is 4.11.12
With kernel 4.11.12 - after hour no change in memory usage.

also checked that with ixgbe instead of i40e with same net.git 
kernel there
is no memleak - after hour same memory usage - so for 100% this is 
i40e

driver problem.

So how long was the run to get the .5GB of memory leaking?

1 hour



Also is there any chance of you being able to bisect to determine
where the memory leak was introduced since as you pointed out it
didn't exist in 4.11.12 so odds are it was introduced somewhere
between 4.11 and the latest kernel release.
Can be hard cause currently need to back to 4.11.12 - this is 
production host/router
Will try to find some free/test router for tests/bicects with i40e 
driver (intel 710 cards)




Thanks.

- Alex





Also forgoto to add errors for i40e when driver initialize:
[   15.760569] i40e :02:00.1: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.365587] i40e :03:00.3: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.367686] i40e :02:00.2: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.368816] i40e :03:00.0: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.369877] i40e :03:00.2: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.370941] i40e :02:00.3: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.372005] i40e :02:00.0: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.373029] i40e :03:00.1: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on


some params that are set for this nic's
    ip link set up dev $i
    ethtool -A $i autoneg off rx off tx off
    ethtool -G $i rx 1024 tx 2048
    ip link set $i txqueuelen 1000
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 
tx-usecs 128

    ethtool -L $i combined 6
    #ethtool -N $i rx-flow-hash udp4 sdfn
    ethtool -K $i ntuple on
    ethtool -K $i gro off
    ethtool -K $i tso off





Also after TSO/GRO on there is memory usage change - and leaking faster
Below image from memory usage before change with TSO/GRO OFF and after 
enabling TSO/GRO


https://ibb.co/dTqBY6


Thanks
Pawel

Re: Linux 4.12+ memory leak on router with i40e NICs




W dniu 2017-10-17 o 02:44, Paweł Staszewski pisze:



W dniu 2017-10-17 o 01:56, Alexander Duyck pisze:
On Mon, Oct 16, 2017 at 4:34 PM, Paweł Staszewski 
 wrote:


W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze:



W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze:

On 15/10/2017 02:58 πμ, Alexander Duyck wrote:

Hi Pawel,

To clarify is that Dave Miller's tree or Linus's that you are 
talking
about? If it is Dave's tree how long ago was it you pulled it 
since I

think the fix was just pushed by Jeff Kirsher a few days ago.

The issue should be fixed in the following commit:

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 



Do you know when it is going to be available on net-next and 
linux-stable

repos?

Cheers,
Pavlos


I will make some tests today night with "net" git tree where this 
patch is

included.
Starting from 0:00 CET
:)



Upgraded and looks like problem is not solved with that patch
Currently running system with
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/
kernel

Still about 0.5GB of memory is leaking somewhere

Also can confirm that the latest kernel where memory is not leaking 
(with

use i40e driver intel 710 cards) is 4.11.12
With kernel 4.11.12 - after hour no change in memory usage.

also checked that with ixgbe instead of i40e with same net.git 
kernel there

is no memleak - after hour same memory usage - so for 100% this is i40e
driver problem.

So how long was the run to get the .5GB of memory leaking?

1 hour



Also is there any chance of you being able to bisect to determine
where the memory leak was introduced since as you pointed out it
didn't exist in 4.11.12 so odds are it was introduced somewhere
between 4.11 and the latest kernel release.
Can be hard cause currently need to back to 4.11.12 - this is 
production host/router
Will try to find some free/test router for tests/bicects with i40e 
driver (intel 710 cards)




Thanks.

- Alex





Also forgoto to add errors for i40e when driver initialize:
[   15.760569] i40e :02:00.1: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.365587] i40e :03:00.3: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.367686] i40e :02:00.2: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.368816] i40e :03:00.0: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.369877] i40e :03:00.2: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.370941] i40e :02:00.3: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.372005] i40e :02:00.0: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on
[   16.373029] i40e :03:00.1: Error I40E_AQ_RC_ENOSPC adding RX 
filters on PF, promiscuous mode forced on


some params that are set for this nic's
    ip link set up dev $i
    ethtool -A $i autoneg off rx off tx off
    ethtool -G $i rx 1024 tx 2048
    ip link set $i txqueuelen 1000
    ethtool -C $i adaptive-rx off adaptive-tx off rx-usecs 512 
tx-usecs 128

    ethtool -L $i combined 6
    #ethtool -N $i rx-flow-hash udp4 sdfn
    ethtool -K $i ntuple on
    ethtool -K $i gro off
    ethtool -K $i tso off

Re: Linux 4.12+ memory leak on router with i40e NICs

2017-10-16 Thread Paweł Staszewski




W dniu 2017-10-17 o 01:56, Alexander Duyck pisze:

On Mon, Oct 16, 2017 at 4:34 PM, Paweł Staszewski  wrote:


W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze:



W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze:

On 15/10/2017 02:58 πμ, Alexander Duyck wrote:

Hi Pawel,

To clarify is that Dave Miller's tree or Linus's that you are talking
about? If it is Dave's tree how long ago was it you pulled it since I
think the fix was just pushed by Jeff Kirsher a few days ago.

The issue should be fixed in the following commit:

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972


Do you know when it is going to be available on net-next and linux-stable
repos?

Cheers,
Pavlos



I will make some tests today night with "net" git tree where this patch is
included.
Starting from 0:00 CET
:)



Upgraded and looks like problem is not solved with that patch
Currently running system with
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/
kernel

Still about 0.5GB of memory is leaking somewhere

Also can confirm that the latest kernel where memory is not leaking (with
use i40e driver intel 710 cards) is 4.11.12
With kernel 4.11.12 - after hour no change in memory usage.

also checked that with ixgbe instead of i40e with same  net.git kernel there
is no memleak - after hour same memory usage - so for 100% this is i40e
driver problem.

So how long was the run to get the .5GB of memory leaking?

1 hour



Also is there any chance of you being able to bisect to determine
where the memory leak was introduced since as you pointed out it
didn't exist in 4.11.12 so odds are it was introduced somewhere
between 4.11 and the latest kernel release.
Can be hard cause currently need to back to 4.11.12 - this is production 
host/router
Will try to find some free/test router for tests/bicects with i40e 
driver (intel 710 cards)




Thanks.

- Alex

Re: Linux 4.12+ memory leak on router with i40e NICs

2017-10-16 Thread Paweł Staszewski




W dniu 2017-10-16 o 18:26, Paweł Staszewski pisze:



W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze:

On 15/10/2017 02:58 πμ, Alexander Duyck wrote:

Hi Pawel,

To clarify is that Dave Miller's tree or Linus's that you are talking
about? If it is Dave's tree how long ago was it you pulled it since I
think the fix was just pushed by Jeff Kirsher a few days ago.

The issue should be fixed in the following commit:
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972 





Do you know when it is going to be available on net-next and 
linux-stable repos?


Cheers,
Pavlos


I will make some tests today night with "net" git tree where this 
patch is included.

Starting from 0:00 CET
:)



Upgraded and looks like problem is not solved with that patch
Currently running system with
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/
kernel

Still about 0.5GB of memory is leaking somewhere

Also can confirm that the latest kernel where memory is not leaking 
(with use i40e driver intel 710 cards) is 4.11.12

With kernel 4.11.12 - after hour no change in memory usage.

also checked that with ixgbe instead of i40e with same  net.git kernel 
there is no memleak - after hour same memory usage - so for 100% this is 
i40e driver problem.

Re: Linux 4.12+ memory leak on router with i40e NICs

2017-10-16 Thread Paweł Staszewski




W dniu 2017-10-16 o 13:20, Pavlos Parissis pisze:

On 15/10/2017 02:58 πμ, Alexander Duyck wrote:

Hi Pawel,

To clarify is that Dave Miller's tree or Linus's that you are talking
about? If it is Dave's tree how long ago was it you pulled it since I
think the fix was just pushed by Jeff Kirsher a few days ago.

The issue should be fixed in the following commit:
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972



Do you know when it is going to be available on net-next and linux-stable repos?

Cheers,
Pavlos


I will make some tests today night with "net" git tree where this patch 
is included.

Starting from 0:00 CET
:)

Re: Linux 4.12+ memory leak on router with i40e NICs

2017-10-15 Thread Paweł Staszewski

  16:53   0:00 
[kworker/4:3]
root 20108  0.0  0.0  0 0 ?    I    16:54   0:00 
[kworker/3:2]
root 20109  0.0  0.0  0 0 ?    I    16:54   0:00 
[kworker/3:3]
root 20110  0.0  0.0  0 0 ?    I    16:54   0:00 
[kworker/0:6]
root 20217  0.0  0.0  0 0 ?    I    16:55   0:00 
[kworker/1:0]
root 20219  0.0  0.0  0 0 ?    I    16:56   0:00 
[kworker/9:1]
root 20222  0.0  0.0  0 0 ?    I    16:56   0:00 
[kworker/9:3]
root 20354  0.0  0.0  0 0 ?    I    16:57   0:00 
[kworker/5:0]
root 20355  0.0  0.0  0 0 ?    I    16:57   0:00 
[kworker/5:3]
root 20814  0.0  0.0  0 0 ?    I    16:57   0:00 
[kworker/u24:2]
root 26845  0.0  0.0  0 0 ?    I    15:40   0:00 
[kworker/7:2]
root 26979  0.0  0.0  0 0 ?    I    15:43   0:00 
[kworker/0:3]
root 27375  0.0  0.0  0 0 ?    I    15:48   0:00 
[kworker/0:2]


but free -m
free -m
  total    used    free  shared buff/cache   
available

Mem:  32113   18345   13598   0 169   13419
Swap:  3911   0    3911


less and less about 0.5MB per hour

it looks like this commit:

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972


Is not included in:
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git


Ok will upgrade tomorrow - and will check with that fix.



W dniu 2017-10-15 o 02:58, Alexander Duyck pisze:

Hi Pawel,

To clarify is that Dave Miller's tree or Linus's that you are talking
about? If it is Dave's tree how long ago was it you pulled it since I
think the fix was just pushed by Jeff Kirsher a few days ago.

The issue should be fixed in the following commit:
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972

Thanks.

- Alex

On Sat, Oct 14, 2017 at 3:03 PM, Paweł Staszewski  wrote:

Forgot to add - this graphs are tested with Kernel 4.14-rc4-next


W dniu 2017-10-15 o 00:00, Paweł Staszewski pisze:

Same problem here

Also only difference is change 82599 intel to x710 and have memleak

mem with ixgbe driver over time - same config saame kernel



changed NIC's to x710 i40e driver (this is the only change)

And mem over time:



There is no process that is eating memory - looks like there is some problem
with i40e driver - but it not a surprise :) this driver is really buggy -
with many things - most tickets on e1000e sourceforge that i openned have no
reply for year or more - or if somebody reply after year they are closing
ticket after 1 day with info about no activity :)



W dniu 2017-10-05 o 07:19, Anders K. Pedersen | Cohaesio pisze:

On ons, 2017-10-04 at 08:32 -0700, Alexander Duyck wrote:

On Wed, Oct 4, 2017 at 5:56 AM, Anders K. Pedersen | Cohaesio
 wrote:

Hello,

After updating one of our Linux based routers to kernel 4.13 it
began
leaking memory quite fast (about 1 GB every half hour). To narrow
we
tried various kernel versions and found that 4.11.12 is okay, while
4.12 also leaks, so we did a bisection between 4.11 and 4.12.

The first bisection ended at
"[6964e53f55837b0c49ed60d36656d2e0ee4fc27b] i40e: fix handling of
HW
ATR eviction", which fixes some flag handling that was broken by
47994c119a36 "i40e: remove hw_disabled_flags in favor of using
separate
flag bits", so I did a second bisection, where I added 6964e53f5583
"i40e: fix handling of HW ATR eviction" to the steps that had
47994c119a36 "i40e: remove hw_disabled_flags in favor of using
separate
flag bits" in them.

The second bisection ended at
"[0e626ff7ccbfc43c6cc4aeea611c40b899682382] i40e: Fix support for
flow
director programming status", where I don't see any obvious
problems,
so I'm hoping for some assistance.

The router is a PowerEdge R730 server (Haswell based) with three
Intel
NICs (all using the i40e driver):

X710 quad port 10 GbE SFP+: eth0 eth1 eth2 eth3
X710 quad port 10 GbE SFP+: eth4 eth5 eth6 eth7
XL710 dual port 40 GbE QSFP+: eth8 eth9

The NICs are aggregated with LACP with the team driver:

team0: eth9 (40 GbE selected primary), and eth3, eth7 (10 GbE non-
selected backups)
team1: eth0, eth1, eth4, eth5 (all 10 GbE selected)

team0 is used for internal networks and has one untagged and four
tagged VLAN interfaces, while team1 has an external uplink
connection
without any VLANs.

The router runs an eBGP session on team1 to one of our uplinks, and
iBGP via team0 to our other border routers. It also runs OSPF on
the
internal VLANs on team0. One thing I've noticed is that when OSPF
is
not announcing a default gateway to the internal networks, so there
is
almost no traffic coming in on team0 and out on team1, but still

Re: Latest kernel net-next - 4.14-rc1+ / WARNING: CPU: 16 PID: 0 at net/sched/sch_hfsc.c:1385 hfsc_dequeue+0x241/0x269

2017-09-26 Thread Paweł Staszewski


A little more in trace:

[49519.600903] [ cut here ]
[49519.600908] WARNING: CPU: 7 PID: 31426 at net/sched/sch_hfsc.c:1385 
hfsc_dequeue+0x241/0x269

[49519.600909] Modules linked in: ipmi_si x86_pkg_temp_thermal
[49519.600914] CPU: 7 PID: 31426 Comm: syslog-ng Tainted: G W   
4.14.0-rc1+ #10

[49519.600915] task: 88086d07c100 task.stack: c90006d54000
[49519.600917] RIP: 0010:hfsc_dequeue+0x241/0x269
[49519.600918] RSP: 0018:88087fa439f8 EFLAGS: 00010246
[49519.600919] RAX:  RBX: 88085b8af148 RCX: 
0018
[49519.600920] RDX:  RSI:  RDI: 
88085b8af440
[49519.600922] RBP: 88087fa43a20 R08: 1600 R09: 

[49519.600923] R10: 88087fa43960 R11: 880859d50a00 R12: 
88085b8af000
[49519.600924] R13: 00b4266fab9b R14: 0001 R15: 
88085b8af440
[49519.600925] FS:  7fad63a35700() GS:88087fa4() 
knlGS:

[49519.600926] CS:  0010 DS:  ES:  CR0: 80050033
[49519.600927] CR2: 7f10d4a90098 CR3: 00046bb1c005 CR4: 
001606e0

[49519.600928] Call Trace:
[49519.600929]  
[49519.600932]  __qdisc_run+0xed/0x293
[49519.600935]  __dev_queue_xmit+0x2d2/0x4b3
[49519.600936]  ? eth_header+0x27/0xab
[49519.600938]  dev_queue_xmit+0xb/0xd
[49519.600939]  ? dev_queue_xmit+0xb/0xd
[49519.600943]  neigh_connected_output+0x9b/0xb2
[49519.600948]  ip_finish_output2+0x24b/0x28f
[49519.600952]  ? statistic_mt+0x30/0x72
[49519.600954]  ip_finish_output+0x101/0x10d
[49519.600957]  ip_output+0x56/0xa9
[49519.600959]  ip_forward_finish+0x53/0x58
[49519.600961]  ip_forward+0x2b2/0x308
[49519.600962]  ? ip_frag_mem+0xf/0xf
[49519.600964]  ip_rcv_finish+0x27c/0x287
[49519.600965]  ip_rcv+0x2b0/0x300
[49519.600968]  ? vlan_do_receive+0x49/0x294
[49519.600970]  __netif_receive_skb_core+0x312/0x496
[49519.600972]  ? tk_clock_read+0xc/0xe
[49519.600973]  __netif_receive_skb+0x18/0x57
[49519.600974]  ? __netif_receive_skb+0x18/0x57
[49519.600975]  netif_receive_skb_internal+0x4b/0xa1
[49519.600977]  napi_gro_complete+0x7a/0x7d
[49519.600977]  napi_gro_flush+0x3b/0x66
[49519.600979]  napi_complete_done+0x4b/0xa8
[49519.600983]  ixgbe_poll+0x90c/0xeaa
[49519.600985]  net_rx_action+0xd3/0x22d
[49519.600988]  __do_softirq+0xe4/0x23a
[49519.600991]  irq_exit+0x4d/0x5b
[49519.600992]  do_IRQ+0x96/0xae
[49519.600996]  common_interrupt+0x90/0x90
[49519.600997]  
[49519.601001] RIP: 0010:do_con_write+0x2d0/0x1b13
[49519.601002] RSP: 0018:c90006d57bc0 EFLAGS: 0282 ORIG_RAX: 
ffa2
[49519.601004] RAX: 004b RBX: 004b RCX: 
fffd
[49519.601005] RDX: 004d RSI: 0003 RDI: 
81e751b8
[49519.601006] RBP: c90006d57c68 R08:  R09: 

[49519.601007] R10: 88086b64f807 R11: 88086d07c100 R12: 
fffd
[49519.601008] R13: 88046c7bd400 R14: 88086b64f800 R15: 
004b

[49519.601011]  ? _raw_spin_lock+0x9/0xb
[49519.601013]  con_write+0xe/0x20
[49519.601018]  n_tty_write+0x101/0x3f5
[49519.601021]  ? init_wait_entry+0x29/0x29
[49519.601024]  tty_write+0x1a9/0x228
[49519.601026]  ? n_tty_flush_buffer+0x4c/0x4c
[49519.601029]  do_loop_readv_writev+0x6f/0xa1
[49519.601031]  do_iter_write+0x8e/0xb8
[49519.601032]  vfs_writev+0x77/0xad
[49519.601034]  ? __vfs_write+0x21/0xa0
[49519.601037]  ? __fget+0x25/0x56
[49519.601038]  ? __fget_light+0x3b/0x46
[49519.601039]  ? __fdget+0xe/0x10
[49519.601040]  do_writev+0x4f/0xa1
[49519.601041]  ? do_writev+0x4f/0xa1
[49519.601043]  SyS_writev+0xb/0xd
[49519.601044]  entry_SYSCALL_64_fastpath+0x13/0x94
[49519.601046] RIP: 0033:0x7fad66233da9
[49519.601046] RSP: 002b:7fad63a32ac0 EFLAGS: 0293 ORIG_RAX: 
0014
[49519.601047] RAX: ffda RBX: 01eddd08 RCX: 
7fad66233da9
[49519.601048] RDX: 0001 RSI: 01edeab0 RDI: 
0011
[49519.601049] RBP: 01eddd08 R08:  R09: 
7fad662842d0
[49519.601049] R10:  R11: 0293 R12: 
7fad63a326f0
[49519.601050] R13: 7fad4c0045c0 R14: 7fad4c0045c0 R15: 
01eddcb0
[49519.601051] Code: f6 48 3d 90 00 00 00 74 04 48 8b 70 70 49 8b 84 24 
68 02 00 00 48 85 c0 74 0c 48 39 f0 72 24 48 85 f6 75 09 eb 1d 48 85 f6 
75 02 <0f> ff 49 8d bc 24 48 04 00 00 48 c1 e6 06 e8 a9 62 ff ff e9 eb

[49519.601065] ---[ end trace 8558fb6f1ca3beb0 ]---



W dniu 2017-09-26 o 14:00, Paweł Staszewski pisze:

[50102.787542] [ cut here ]
[50102.787545] WARNING: CPU: 16 PID: 0 at net/sched/sch_hfsc.c:1385 
hfsc_dequeue+0x241/0x269

[50102.787545] Modules linked in: ipmi_si x86_pkg_temp_thermal
[50102.787547] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G W 
4.14.0-rc1+ #10

[50102.787548] task: 88046d44 task.stack: c900032e
[50102.787549] RIP: 0010:hfsc_dequeue+0x241/0x269
[50102.78755

Latest kernel net-next - 4.14-rc1+ / WARNING: CPU: 16 PID: 0 at net/sched/sch_hfsc.c:1385 hfsc_dequeue+0x241/0x269

2017-09-26 Thread Paweł Staszewski


[50102.787542] [ cut here ]
[50102.787545] WARNING: CPU: 16 PID: 0 at net/sched/sch_hfsc.c:1385 
hfsc_dequeue+0x241/0x269

[50102.787545] Modules linked in: ipmi_si x86_pkg_temp_thermal
[50102.787547] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G W   
4.14.0-rc1+ #10

[50102.787548] task: 88046d44 task.stack: c900032e
[50102.787549] RIP: 0010:hfsc_dequeue+0x241/0x269
[50102.787550] RSP: 0018:88046fc83eb0 EFLAGS: 00010246
[50102.787551] RAX:  RBX: 880456309948 RCX: 
0018
[50102.787551] RDX:  RSI:  RDI: 
880456309c40
[50102.787552] RBP: 88046fc83ed8 R08: 0001c000 R09: 
0100
[50102.787553] R10: 88046fc83e98 R11: 0003 R12: 
880456309800
[50102.787553] R13: 00b6459156dd R14: 0001 R15: 
880456309c40
[50102.787554] FS:  () GS:88046fc8() 
knlGS:

[50102.787555] CS:  0010 DS:  ES:  CR0: 80050033
[50102.787556] CR2: 7fc764f21090 CR3: 00085844a000 CR4: 
001606e0

[50102.787556] Call Trace:
[50102.787557]  
[50102.787558]  __qdisc_run+0xed/0x293
[50102.787560]  net_tx_action+0xeb/0x18b
[50102.787562]  __do_softirq+0xe4/0x23a
[50102.787564]  irq_exit+0x4d/0x5b
[50102.787565]  smp_apic_timer_interrupt+0xc0/0xfa
[50102.787566]  apic_timer_interrupt+0x90/0xa0
[50102.787566]  
[50102.787568] RIP: 0010:cpuidle_enter_state+0x134/0x189
[50102.787569] RSP: 0018:c900032e3ea0 EFLAGS: 0246 ORIG_RAX: 
ff10
[50102.787570] RAX: 2d9176d7d9f4 RBX: 0002 RCX: 
001f
[50102.787570] RDX:  RSI: 0010 RDI: 

[50102.787571] RBP: c900032e3ed0 R08: ffd8 R09: 
0003
[50102.787572] R10: c900032e3e70 R11: 88046fc98e50 R12: 
88046c234400
[50102.787572] R13: 2d9176d7d9f4 R14: 0002 R15: 
2d9176d6e845

[50102.787575]  cpuidle_enter+0x12/0x14
[50102.787576]  do_idle+0x113/0x16b
[50102.787578]  cpu_startup_entry+0x1a/0x1f
[50102.787580]  start_secondary+0xea/0xed
[50102.787581]  secondary_startup_64+0xa5/0xa5
[50102.787582] Code: f6 48 3d 90 00 00 00 74 04 48 8b 70 70 49 8b 84 24 
68 02 00 00 48 85 c0 74 0c 48 39 f0 72 24 48 85 f6 75 09 eb 1d 48 85 f6 
75 02 <0f> ff 49 8d bc 24 48 04 00 00 48 c1 e6 06 e8 a9 62 ff ff e9 eb

[50102.787602] ---[ end trace 8558fb6f1ca3beb2 ]---

Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on




W dniu 2017-09-21 o 23:41, Florian Fainelli pisze:

On 09/21/2017 02:26 PM, Paweł Staszewski wrote:


W dniu 2017-08-15 o 11:11, Paweł Staszewski pisze:

diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c
index
5e831de3103e2f7092c7fa15534def403bc62fb4..9472de846d5c0960996261cb2843032847fa4bf7
100644
--- a/net/8021q/vlan_netlink.c
+++ b/net/8021q/vlan_netlink.c
@@ -143,6 +143,7 @@ static int vlan_newlink(struct net *src_net,
struct net_device *dev,
   vlan->vlan_proto = proto;
   vlan->vlan_id = nla_get_u16(data[IFLA_VLAN_ID]);
   vlan->real_dev = real_dev;
+dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE);
   vlan->flags = VLAN_FLAG_REORDER_HDR;
 err = vlan_check_real_dev(real_dev, vlan->vlan_proto,
vlan->vlan_id);

Any plans for this patch to go normal into the kernel ?

Would not this apply to pretty much any stacked device setup though? It
seems like any network device that just queues up its packet on another
physical device for actual transmission may need that (e.g: DSA, bond,
team, more.?)

Some devices libe bond have it.

Just maybee when there was first patch vlans were not taken into account.
Did not checked all :)

But I know Eric will do :)

Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on




W dniu 2017-09-21 o 23:34, Eric Dumazet pisze:

On Thu, 2017-09-21 at 23:26 +0200, Paweł Staszewski wrote:

W dniu 2017-08-15 o 11:11, Paweł Staszewski pisze:

diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c
index
5e831de3103e2f7092c7fa15534def403bc62fb4..9472de846d5c0960996261cb2843032847fa4bf7
100644
--- a/net/8021q/vlan_netlink.c
+++ b/net/8021q/vlan_netlink.c
@@ -143,6 +143,7 @@ static int vlan_newlink(struct net *src_net,
struct net_device *dev,
   vlan->vlan_proto = proto;
   vlan->vlan_id = nla_get_u16(data[IFLA_VLAN_ID]);
   vlan->real_dev = real_dev;
+dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE);
   vlan->flags = VLAN_FLAG_REORDER_HDR;
 err = vlan_check_real_dev(real_dev, vlan->vlan_proto,
vlan->vlan_id);

Any plans for this patch to go normal into the kernel ?

So far im using it for about 3 weeks on all my linux based routers - and
no problems.

Yes, I was about to submit it, as I mentioned it few hours ago to you ;)






Yes i saw Your point 2)  in previous emails :)
But there was no patch in previous reply for this so was thinking that 
maybee too many things to do and You forgot about it :)


Thanks
Paweł

Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on




W dniu 2017-08-15 o 11:11, Paweł Staszewski pisze:

diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c
index 
5e831de3103e2f7092c7fa15534def403bc62fb4..9472de846d5c0960996261cb2843032847fa4bf7 
100644

--- a/net/8021q/vlan_netlink.c
+++ b/net/8021q/vlan_netlink.c
@@ -143,6 +143,7 @@ static int vlan_newlink(struct net *src_net, 
struct net_device *dev,

  vlan->vlan_proto = proto;
  vlan->vlan_id = nla_get_u16(data[IFLA_VLAN_ID]);
  vlan->real_dev = real_dev;
+    dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE);
  vlan->flags = VLAN_FLAG_REORDER_HDR;
    err = vlan_check_real_dev(real_dev, vlan->vlan_proto, 
vlan->vlan_id); 


Any plans for this patch to go normal into the kernel ?

So far im using it for about 3 weeks on all my linux based routers - and 
no problems.

Re: Latest net-next from GIT panic




W dniu 2017-09-21 o 13:31, Paweł Staszewski pisze:



W dniu 2017-09-21 o 13:03, Eric Dumazet pisze:

OK we have two problems here

1) We need to unify skb_dst_force()  ( for net tree )

2) Vlan devices should try to correctly handle IFF_XMIT_DST_RELEASE from
lower device. This will considerably help your performance.


For 1), this is what I had in mind, can you try it ?

Thanks a lot !

diff --git a/include/net/dst.h b/include/net/dst.h
index 
93568bd0a3520bb7402f04d90cf04ac99c81cfbe..f23851eeaad917e8dafc06b58d23a2575405c894 
100644

--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -271,7 +271,7 @@ static inline void dst_use_noref(struct dst_entry 
*dst, unsigned long time)

  static inline struct dst_entry *dst_clone(struct dst_entry *dst)
  {
  if (dst)
-    atomic_inc(&dst->__refcnt);
+    dst_hold(dst);
  return dst;
  }
  @@ -311,21 +311,6 @@ static inline void skb_dst_copy(struct sk_buff 
*nskb, const struct sk_buff *oskb

  __skb_dst_copy(nskb, oskb->_skb_refdst);
  }
  -/**
- * skb_dst_force - makes sure skb dst is refcounted
- * @skb: buffer
- *
- * If dst is not yet refcounted, let's do it
- */
-static inline void skb_dst_force(struct sk_buff *skb)
-{
-    if (skb_dst_is_noref(skb)) {
-    WARN_ON(!rcu_read_lock_held());
-    skb->_skb_refdst &= ~SKB_DST_NOREF;
-    dst_clone(skb_dst(skb));
-    }
-}
-
  /**
   * dst_hold_safe - Take a reference on a dst if possible
   * @dst: pointer to dst entry
@@ -356,6 +341,23 @@ static inline void skb_dst_force_safe(struct 
sk_buff *skb)

  }
  }
  +/**
+ * skb_dst_force - makes sure skb dst is refcounted
+ * @skb: buffer
+ *
+ * If dst is not yet refcounted, let's do it
+ */
+static inline void skb_dst_force(struct sk_buff *skb)
+{
+    if (skb_dst_is_noref(skb)) {
+    struct dst_entry *dst = skb_dst(skb);
+
+    WARN_ON(!rcu_read_lock_held());
+    if (!dst_hold_safe(dst))
+    dst = NULL;
+    skb->_skb_refdst = (unsigned long)dst;
+    }
+}
    /**
   *    __skb_tunnel_rx - prepare skb for rx reinsert




Patch applied - soo far no problems - and no warnings in dmesg


ok after adding patch all is working from now for about 1 hour of normal 
traffic witc all bgp sessions connected and about 600k prefixes in kernel.

Re: Latest net-next from GIT panic




W dniu 2017-09-21 o 13:03, Eric Dumazet pisze:

OK we have two problems here

1) We need to unify skb_dst_force()  ( for net tree )

2) Vlan devices should try to correctly handle IFF_XMIT_DST_RELEASE from
lower device. This will considerably help your performance.


For 1), this is what I had in mind, can you try it ?

Thanks a lot !

diff --git a/include/net/dst.h b/include/net/dst.h
index 
93568bd0a3520bb7402f04d90cf04ac99c81cfbe..f23851eeaad917e8dafc06b58d23a2575405c894
 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -271,7 +271,7 @@ static inline void dst_use_noref(struct dst_entry *dst, 
unsigned long time)
  static inline struct dst_entry *dst_clone(struct dst_entry *dst)
  {
if (dst)
-   atomic_inc(&dst->__refcnt);
+   dst_hold(dst);
return dst;
  }
  
@@ -311,21 +311,6 @@ static inline void skb_dst_copy(struct sk_buff *nskb, const struct sk_buff *oskb

__skb_dst_copy(nskb, oskb->_skb_refdst);
  }
  
-/**

- * skb_dst_force - makes sure skb dst is refcounted
- * @skb: buffer
- *
- * If dst is not yet refcounted, let's do it
- */
-static inline void skb_dst_force(struct sk_buff *skb)
-{
-   if (skb_dst_is_noref(skb)) {
-   WARN_ON(!rcu_read_lock_held());
-   skb->_skb_refdst &= ~SKB_DST_NOREF;
-   dst_clone(skb_dst(skb));
-   }
-}
-
  /**
   * dst_hold_safe - Take a reference on a dst if possible
   * @dst: pointer to dst entry
@@ -356,6 +341,23 @@ static inline void skb_dst_force_safe(struct sk_buff *skb)
}
  }
  
+/**

+ * skb_dst_force - makes sure skb dst is refcounted
+ * @skb: buffer
+ *
+ * If dst is not yet refcounted, let's do it
+ */
+static inline void skb_dst_force(struct sk_buff *skb)
+{
+   if (skb_dst_is_noref(skb)) {
+   struct dst_entry *dst = skb_dst(skb);
+
+   WARN_ON(!rcu_read_lock_held());
+   if (!dst_hold_safe(dst))
+   dst = NULL;
+   skb->_skb_refdst = (unsigned long)dst;
+   }
+}
  
  /**

   *__skb_tunnel_rx - prepare skb for rx reinsert




Patch applied - soo far no problems - and no warnings in dmesg

Re: Latest net-next from GIT panic




W dniu 2017-09-21 o 13:12, Paweł Staszewski pisze:



W dniu 2017-09-21 o 13:03, Eric Dumazet pisze:

On Thu, 2017-09-21 at 11:06 +0200, Paweł Staszewski wrote:

W dniu 2017-09-21 o 03:17, Eric Dumazet pisze:

On Wed, 2017-09-20 at 18:09 -0700, Wei Wang wrote:

Thanks very much Pawel for the feedback.

I was looking into the code (specifically IPv4 part) and found 
that in
free_fib_info_rcu(), we call free_nh_exceptions() without holding 
the

fnhe_lock. I am wondering if that could cause some race condition on
fnhe->fnhe_rth_input/output so a double call on dst_dev_put() on the
same dst could be happening.

But as we call free_fib_info_rcu() only after the grace period, and
the lookup code which could potentially modify
fnhe->fnhe_rth_input/output all holds rcu_read_lock(), it seems
fine...


Hi Pawel,

Could you try the following debug patch on top of net-next branch and
reproduce the issue check if there are warning msg showing?

diff --git a/include/net/dst.h b/include/net/dst.h
index 93568bd0a352..82aff41c6f63 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -271,7 +271,7 @@ static inline void dst_use_noref(struct dst_entry
*dst, unsigned long time)
   static inline struct dst_entry *dst_clone(struct dst_entry *dst)
   {
  if (dst)
-   atomic_inc(&dst->__refcnt);
+   dst_hold(dst);
  return dst;
   }

Thanks.
Wei


Yes, we believe skb_dst_force() and skb_dst_force_safe() should be
unified  (to the 'safe' version)

We no longer have gc to protect from 0 -> 1 transition of dst 
refcount.






After adding patch from Wei
https://bugzilla.kernel.org/show_bug.cgi?id=197005#c14


OK we have two problems here

1) We need to unify skb_dst_force()  ( for net tree )

2) Vlan devices should try to correctly handle IFF_XMIT_DST_RELEASE from
lower device. This will considerably help your performance.


For 1), this is what I had in mind, can you try it ?

Thanks a lot !

diff --git a/include/net/dst.h b/include/net/dst.h
index 
93568bd0a3520bb7402f04d90cf04ac99c81cfbe..f23851eeaad917e8dafc06b58d23a2575405c894 
100644

--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -271,7 +271,7 @@ static inline void dst_use_noref(struct dst_entry 
*dst, unsigned long time)

  static inline struct dst_entry *dst_clone(struct dst_entry *dst)
  {
  if (dst)
-    atomic_inc(&dst->__refcnt);
+    dst_hold(dst);
  return dst;
  }
  @@ -311,21 +311,6 @@ static inline void skb_dst_copy(struct sk_buff 
*nskb, const struct sk_buff *oskb

  __skb_dst_copy(nskb, oskb->_skb_refdst);
  }
  -/**
- * skb_dst_force - makes sure skb dst is refcounted
- * @skb: buffer
- *
- * If dst is not yet refcounted, let's do it
- */
-static inline void skb_dst_force(struct sk_buff *skb)
-{
-    if (skb_dst_is_noref(skb)) {
-    WARN_ON(!rcu_read_lock_held());
-    skb->_skb_refdst &= ~SKB_DST_NOREF;
-    dst_clone(skb_dst(skb));
-    }
-}
-
  /**
   * dst_hold_safe - Take a reference on a dst if possible
   * @dst: pointer to dst entry
@@ -356,6 +341,23 @@ static inline void skb_dst_force_safe(struct 
sk_buff *skb)

  }
  }
  +/**
+ * skb_dst_force - makes sure skb dst is refcounted
+ * @skb: buffer
+ *
+ * If dst is not yet refcounted, let's do it
+ */
+static inline void skb_dst_force(struct sk_buff *skb)
+{
+    if (skb_dst_is_noref(skb)) {
+    struct dst_entry *dst = skb_dst(skb);
+
+    WARN_ON(!rcu_read_lock_held());
+    if (!dst_hold_safe(dst))
+    dst = NULL;
+    skb->_skb_refdst = (unsigned long)dst;
+    }
+}
    /**
   *    __skb_tunnel_rx - prepare skb for rx reinsert




Thanks

What is weird i have this part in my net-next from git:
/**
 * skb_dst_force_safe - makes sure skb dst is refcounted
 * @skb: buffer
 *
 * If dst is not yet refcounted and not destroyed, grab a ref on it.
 */
static inline void skb_dst_force_safe(struct sk_buff *skb)
{
    if (skb_dst_is_noref(skb)) {
    struct dst_entry *dst = skb_dst(skb);

    if (!dst_hold_safe(dst))
    dst = NULL;

    skb->_skb_refdst = (unsigned long)dst;
    }
}




ok the difference is skb_dst_force_safe not skb_dst_force

Re: Latest net-next from GIT panic




W dniu 2017-09-21 o 13:03, Eric Dumazet pisze:

On Thu, 2017-09-21 at 11:06 +0200, Paweł Staszewski wrote:

W dniu 2017-09-21 o 03:17, Eric Dumazet pisze:

On Wed, 2017-09-20 at 18:09 -0700, Wei Wang wrote:

Thanks very much Pawel for the feedback.

I was looking into the code (specifically IPv4 part) and found that in
free_fib_info_rcu(), we call free_nh_exceptions() without holding the
fnhe_lock. I am wondering if that could cause some race condition on
fnhe->fnhe_rth_input/output so a double call on dst_dev_put() on the
same dst could be happening.

But as we call free_fib_info_rcu() only after the grace period, and
the lookup code which could potentially modify
fnhe->fnhe_rth_input/output all holds rcu_read_lock(), it seems
fine...


Hi Pawel,

Could you try the following debug patch on top of net-next branch and
reproduce the issue check if there are warning msg showing?

diff --git a/include/net/dst.h b/include/net/dst.h
index 93568bd0a352..82aff41c6f63 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -271,7 +271,7 @@ static inline void dst_use_noref(struct dst_entry
*dst, unsigned long time)
   static inline struct dst_entry *dst_clone(struct dst_entry *dst)
   {
  if (dst)
-   atomic_inc(&dst->__refcnt);
+   dst_hold(dst);
  return dst;
   }

Thanks.
Wei


Yes, we believe skb_dst_force() and skb_dst_force_safe() should be
unified  (to the 'safe' version)

We no longer have gc to protect from 0 -> 1 transition of dst refcount.





After adding patch from Wei
https://bugzilla.kernel.org/show_bug.cgi?id=197005#c14


OK we have two problems here

1) We need to unify skb_dst_force()  ( for net tree )

2) Vlan devices should try to correctly handle IFF_XMIT_DST_RELEASE from
lower device. This will considerably help your performance.


For 1), this is what I had in mind, can you try it ?

Thanks a lot !

diff --git a/include/net/dst.h b/include/net/dst.h
index 
93568bd0a3520bb7402f04d90cf04ac99c81cfbe..f23851eeaad917e8dafc06b58d23a2575405c894
 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -271,7 +271,7 @@ static inline void dst_use_noref(struct dst_entry *dst, 
unsigned long time)
  static inline struct dst_entry *dst_clone(struct dst_entry *dst)
  {
if (dst)
-   atomic_inc(&dst->__refcnt);
+   dst_hold(dst);
return dst;
  }
  
@@ -311,21 +311,6 @@ static inline void skb_dst_copy(struct sk_buff *nskb, const struct sk_buff *oskb

__skb_dst_copy(nskb, oskb->_skb_refdst);
  }
  
-/**

- * skb_dst_force - makes sure skb dst is refcounted
- * @skb: buffer
- *
- * If dst is not yet refcounted, let's do it
- */
-static inline void skb_dst_force(struct sk_buff *skb)
-{
-   if (skb_dst_is_noref(skb)) {
-   WARN_ON(!rcu_read_lock_held());
-   skb->_skb_refdst &= ~SKB_DST_NOREF;
-   dst_clone(skb_dst(skb));
-   }
-}
-
  /**
   * dst_hold_safe - Take a reference on a dst if possible
   * @dst: pointer to dst entry
@@ -356,6 +341,23 @@ static inline void skb_dst_force_safe(struct sk_buff *skb)
}
  }
  
+/**

+ * skb_dst_force - makes sure skb dst is refcounted
+ * @skb: buffer
+ *
+ * If dst is not yet refcounted, let's do it
+ */
+static inline void skb_dst_force(struct sk_buff *skb)
+{
+   if (skb_dst_is_noref(skb)) {
+   struct dst_entry *dst = skb_dst(skb);
+
+   WARN_ON(!rcu_read_lock_held());
+   if (!dst_hold_safe(dst))
+   dst = NULL;
+   skb->_skb_refdst = (unsigned long)dst;
+   }
+}
  
  /**

   *__skb_tunnel_rx - prepare skb for rx reinsert




Thanks

What is weird i have this part in my net-next from git:
/**
 * skb_dst_force_safe - makes sure skb dst is refcounted
 * @skb: buffer
 *
 * If dst is not yet refcounted and not destroyed, grab a ref on it.
 */
static inline void skb_dst_force_safe(struct sk_buff *skb)
{
    if (skb_dst_is_noref(skb)) {
    struct dst_entry *dst = skb_dst(skb);

    if (!dst_hold_safe(dst))
    dst = NULL;

    skb->_skb_refdst = (unsigned long)dst;
    }
}

Re: Latest net-next from GIT panic




W dniu 2017-09-21 o 03:17, Eric Dumazet pisze:

On Wed, 2017-09-20 at 18:09 -0700, Wei Wang wrote:

Thanks very much Pawel for the feedback.

I was looking into the code (specifically IPv4 part) and found that in
free_fib_info_rcu(), we call free_nh_exceptions() without holding the
fnhe_lock. I am wondering if that could cause some race condition on
fnhe->fnhe_rth_input/output so a double call on dst_dev_put() on the
same dst could be happening.

But as we call free_fib_info_rcu() only after the grace period, and
the lookup code which could potentially modify
fnhe->fnhe_rth_input/output all holds rcu_read_lock(), it seems
fine...


Hi Pawel,

Could you try the following debug patch on top of net-next branch and
reproduce the issue check if there are warning msg showing?

diff --git a/include/net/dst.h b/include/net/dst.h
index 93568bd0a352..82aff41c6f63 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -271,7 +271,7 @@ static inline void dst_use_noref(struct dst_entry
*dst, unsigned long time)
  static inline struct dst_entry *dst_clone(struct dst_entry *dst)
  {
 if (dst)
-   atomic_inc(&dst->__refcnt);
+   dst_hold(dst);
 return dst;
  }

Thanks.
Wei



Yes, we believe skb_dst_force() and skb_dst_force_safe() should be
unified  (to the 'safe' version)

We no longer have gc to protect from 0 -> 1 transition of dst refcount.






After adding patch from Wei
https://bugzilla.kernel.org/show_bug.cgi?id=197005#c14

Re: Latest net-next from GIT panic




W dniu 2017-09-20 o 23:25, Paweł Staszewski pisze:



W dniu 2017-09-20 o 23:24, Paweł Staszewski pisze:



W dniu 2017-09-20 o 23:10, Paweł Staszewski pisze:



W dniu 2017-09-20 o 21:23, Paweł Staszewski pisze:



W dniu 2017-09-20 o 21:13, Paweł Staszewski pisze:



W dniu 2017-09-20 o 20:36, Cong Wang pisze:
On Wed, Sep 20, 2017 at 11:30 AM, Eric Dumazet 
 wrote:

On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote:

but dmesg at this time shows nothing about interfaces or flaps.

This is very odd.

We only free netdevice in free_netdev() and it is only called when
we unregister a netdevice. Otherwise pcpu_refcnt is impossible
to be NULL.

If there is a missing dev_hold() or one dev_put() in excess,
this would allow the netdev to be freed too soon.

-> Use after free.
memory holding netdev could be reallocated-cleared by some other 
kernel

user.

Sure, but only unregister could trigger a free. If there is no 
unregister,
like what Pawel claims, then there is no free, the refcnt just 
goes to

0 but the memory is still there.

About possible mistake from my side with bisect - i can judge too 
early that some bisect was good

the road was:
git bisect start
# bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 
'pinctrl-v4.13-1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075
# good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 
'next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
# bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid 
using stack larger than 1024.

git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f
# good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch 
'udp-reduce-cache-pressure'

git bisect good 073cf9e20c333ab29744717a23f9e43ec7512a20
# bad: [8abd5599a520e9f188a750f1bde9dde5fb856230] Merge branch 
's390-net-updates-part-2'

git bisect bad 8abd5599a520e9f188a750f1bde9dde5fb856230
# good: [2fae5d0e647c6470d206e72b5fc24972bb900f70] Merge branch 
'bpf-ctx-narrow'

git bisect good 2fae5d0e647c6470d206e72b5fc24972bb900f70
# good: [41500c3e2a19ffcf40a7158fce1774de08e26ba2] rds: tcp: 
remove cp_outgoing

git bisect good 41500c3e2a19ffcf40a7158fce1774de08e26ba2
# bad: [8917a777be3ba566377be05117f71b93a5fd909d] tcp: md5: add 
TCP_MD5SIG_EXT socket option to set a key address prefix

git bisect bad 8917a777be3ba566377be05117f71b93a5fd909d
# good: [4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36] net: introduce 
a new function dst_dev_put()


And currently have this running for about 4 hours without problems.



git bisect good 4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36
# bad: [a4c2fd7f78915a0d7c5275e7612e7793157a01f2] net: remove 
DST_NOCACHE flag


Here for sure - panic

git bisect bad a4c2fd7f78915a0d7c5275e7612e7793157a01f2
# bad: [ad65a2f05695aced349e308193c6e2a6b1d87112] ipv6: call 
dst_hold_safe() properly

git bisect bad ad65a2f05695aced349e308193c6e2a6b1d87112
# good: [9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call 
dst_hold_safe() properly

git bisect good 9df16efadd2a8a82731dc76ff656c771e261827f
# bad: [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take 
dst->__refcnt for insertion into fib6 tree


im not 100% sure tor last two
Will test them again starting from
[95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call 
dst_dev_put() properly



git bisect bad 1cfb71eeb12047bcdbd3e6730ffed66e810a0855
# bad: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark 
DST_NOGC and remove the operation of dst_free()




git bisect bad b838d5e1c5b6e57b10ec8af2268824041e3ea911
# first bad commit: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] 
ipv4: mark DST_NOGC and remove the operation of dst_free()





What i can say more
I can reproduce this on any server with similar configuration
the difference can be teamd instead of bonding
ixgbe or i40e and mlx5
Same problems

vlans - more or less prefixes learned from bgp -> zebra -> netlink 
-> kernel
But normally in lab when using only plain routing no bgpd and about 
128 vlans - with 128 routes - cant reproduce this - this apperas 
only with bgp - minimum where i can reproduce this was about 130k 
prefixes with about 286 nexthops






bisected again and same result:
b838d5e1c5b6e57b10ec8af2268824041e3ea911 is the first bad commit
commit b838d5e1c5b6e57b10ec8af2268824041e3ea911
Author: Wei Wang 
Date:   Sat Jun 17 10:42:32 2017 -0700

    ipv4: mark DST_NOGC and remove the operation of dst_free()

    With the previous preparation patches, we are ready to get rid 
of the

    dst gc operation in ipv4 code and release dst based on refcnt only.
    So this patch adds DST_NOGC flag for all IPv4 dst and remove the 
calls

    to dst_free().
    At this point, all dst created in ipv4 code do not use the dst gc
    anymore and will be destroyed at the point when refcnt drops to 0.

    Signed

Re: Latest net-next from GIT panic




W dniu 2017-09-20 o 23:24, Paweł Staszewski pisze:



W dniu 2017-09-20 o 23:10, Paweł Staszewski pisze:



W dniu 2017-09-20 o 21:23, Paweł Staszewski pisze:



W dniu 2017-09-20 o 21:13, Paweł Staszewski pisze:



W dniu 2017-09-20 o 20:36, Cong Wang pisze:
On Wed, Sep 20, 2017 at 11:30 AM, Eric Dumazet 
 wrote:

On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote:

but dmesg at this time shows nothing about interfaces or flaps.

This is very odd.

We only free netdevice in free_netdev() and it is only called when
we unregister a netdevice. Otherwise pcpu_refcnt is impossible
to be NULL.

If there is a missing dev_hold() or one dev_put() in excess,
this would allow the netdev to be freed too soon.

-> Use after free.
memory holding netdev could be reallocated-cleared by some other 
kernel

user.

Sure, but only unregister could trigger a free. If there is no 
unregister,
like what Pawel claims, then there is no free, the refcnt just 
goes to

0 but the memory is still there.

About possible mistake from my side with bisect - i can judge too 
early that some bisect was good

the road was:
git bisect start
# bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 
'pinctrl-v4.13-1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075
# good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 
'next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
# bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid 
using stack larger than 1024.

git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f
# good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch 
'udp-reduce-cache-pressure'

git bisect good 073cf9e20c333ab29744717a23f9e43ec7512a20
# bad: [8abd5599a520e9f188a750f1bde9dde5fb856230] Merge branch 
's390-net-updates-part-2'

git bisect bad 8abd5599a520e9f188a750f1bde9dde5fb856230
# good: [2fae5d0e647c6470d206e72b5fc24972bb900f70] Merge branch 
'bpf-ctx-narrow'

git bisect good 2fae5d0e647c6470d206e72b5fc24972bb900f70
# good: [41500c3e2a19ffcf40a7158fce1774de08e26ba2] rds: tcp: remove 
cp_outgoing

git bisect good 41500c3e2a19ffcf40a7158fce1774de08e26ba2
# bad: [8917a777be3ba566377be05117f71b93a5fd909d] tcp: md5: add 
TCP_MD5SIG_EXT socket option to set a key address prefix

git bisect bad 8917a777be3ba566377be05117f71b93a5fd909d
# good: [4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36] net: introduce a 
new function dst_dev_put()


And currently have this running for about 4 hours without problems.



git bisect good 4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36
# bad: [a4c2fd7f78915a0d7c5275e7612e7793157a01f2] net: remove 
DST_NOCACHE flag


Here for sure - panic

git bisect bad a4c2fd7f78915a0d7c5275e7612e7793157a01f2
# bad: [ad65a2f05695aced349e308193c6e2a6b1d87112] ipv6: call 
dst_hold_safe() properly

git bisect bad ad65a2f05695aced349e308193c6e2a6b1d87112
# good: [9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call 
dst_hold_safe() properly

git bisect good 9df16efadd2a8a82731dc76ff656c771e261827f
# bad: [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take 
dst->__refcnt for insertion into fib6 tree


im not 100% sure tor last two
Will test them again starting from
[95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call dst_dev_put() 
properly



git bisect bad 1cfb71eeb12047bcdbd3e6730ffed66e810a0855
# bad: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark 
DST_NOGC and remove the operation of dst_free()




git bisect bad b838d5e1c5b6e57b10ec8af2268824041e3ea911
# first bad commit: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] 
ipv4: mark DST_NOGC and remove the operation of dst_free()





What i can say more
I can reproduce this on any server with similar configuration
the difference can be teamd instead of bonding
ixgbe or i40e and mlx5
Same problems

vlans - more or less prefixes learned from bgp -> zebra -> netlink 
-> kernel
But normally in lab when using only plain routing no bgpd and about 
128 vlans - with 128 routes - cant reproduce this - this apperas 
only with bgp - minimum where i can reproduce this was about 130k 
prefixes with about 286 nexthops






bisected again and same result:
b838d5e1c5b6e57b10ec8af2268824041e3ea911 is the first bad commit
commit b838d5e1c5b6e57b10ec8af2268824041e3ea911
Author: Wei Wang 
Date:   Sat Jun 17 10:42:32 2017 -0700

    ipv4: mark DST_NOGC and remove the operation of dst_free()

    With the previous preparation patches, we are ready to get rid of 
the

    dst gc operation in ipv4 code and release dst based on refcnt only.
    So this patch adds DST_NOGC flag for all IPv4 dst and remove the 
calls

    to dst_free().
    At this point, all dst created in ipv4 code do not use the dst gc
    anymore and will be destroyed at the point when refcnt drops to 0.

    Signed-off-by: Wei Wang 
    Acked-by: Mart

Re: Latest net-next from GIT panic




W dniu 2017-09-20 o 23:10, Paweł Staszewski pisze:



W dniu 2017-09-20 o 21:23, Paweł Staszewski pisze:



W dniu 2017-09-20 o 21:13, Paweł Staszewski pisze:



W dniu 2017-09-20 o 20:36, Cong Wang pisze:
On Wed, Sep 20, 2017 at 11:30 AM, Eric Dumazet 
 wrote:

On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote:

but dmesg at this time shows nothing about interfaces or flaps.

This is very odd.

We only free netdevice in free_netdev() and it is only called when
we unregister a netdevice. Otherwise pcpu_refcnt is impossible
to be NULL.

If there is a missing dev_hold() or one dev_put() in excess,
this would allow the netdev to be freed too soon.

-> Use after free.
memory holding netdev could be reallocated-cleared by some other 
kernel

user.

Sure, but only unregister could trigger a free. If there is no 
unregister,

like what Pawel claims, then there is no free, the refcnt just goes to
0 but the memory is still there.

About possible mistake from my side with bisect - i can judge too 
early that some bisect was good

the road was:
git bisect start
# bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 
'pinctrl-v4.13-1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075
# good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 
'next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
# bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid 
using stack larger than 1024.

git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f
# good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch 
'udp-reduce-cache-pressure'

git bisect good 073cf9e20c333ab29744717a23f9e43ec7512a20
# bad: [8abd5599a520e9f188a750f1bde9dde5fb856230] Merge branch 
's390-net-updates-part-2'

git bisect bad 8abd5599a520e9f188a750f1bde9dde5fb856230
# good: [2fae5d0e647c6470d206e72b5fc24972bb900f70] Merge branch 
'bpf-ctx-narrow'

git bisect good 2fae5d0e647c6470d206e72b5fc24972bb900f70
# good: [41500c3e2a19ffcf40a7158fce1774de08e26ba2] rds: tcp: remove 
cp_outgoing

git bisect good 41500c3e2a19ffcf40a7158fce1774de08e26ba2
# bad: [8917a777be3ba566377be05117f71b93a5fd909d] tcp: md5: add 
TCP_MD5SIG_EXT socket option to set a key address prefix

git bisect bad 8917a777be3ba566377be05117f71b93a5fd909d
# good: [4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36] net: introduce a 
new function dst_dev_put()


And currently have this running for about 4 hours without problems.



git bisect good 4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36
# bad: [a4c2fd7f78915a0d7c5275e7612e7793157a01f2] net: remove 
DST_NOCACHE flag


Here for sure - panic

git bisect bad a4c2fd7f78915a0d7c5275e7612e7793157a01f2
# bad: [ad65a2f05695aced349e308193c6e2a6b1d87112] ipv6: call 
dst_hold_safe() properly

git bisect bad ad65a2f05695aced349e308193c6e2a6b1d87112
# good: [9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call 
dst_hold_safe() properly

git bisect good 9df16efadd2a8a82731dc76ff656c771e261827f
# bad: [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take 
dst->__refcnt for insertion into fib6 tree


im not 100% sure tor last two
Will test them again starting from
[95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call dst_dev_put() 
properly



git bisect bad 1cfb71eeb12047bcdbd3e6730ffed66e810a0855
# bad: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark 
DST_NOGC and remove the operation of dst_free()




git bisect bad b838d5e1c5b6e57b10ec8af2268824041e3ea911
# first bad commit: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: 
mark DST_NOGC and remove the operation of dst_free()





What i can say more
I can reproduce this on any server with similar configuration
the difference can be teamd instead of bonding
ixgbe or i40e and mlx5
Same problems

vlans - more or less prefixes learned from bgp -> zebra -> netlink -> 
kernel
But normally in lab when using only plain routing no bgpd and about 
128 vlans - with 128 routes - cant reproduce this - this apperas only 
with bgp - minimum where i can reproduce this was about 130k prefixes 
with about 286 nexthops






bisected again and same result:
b838d5e1c5b6e57b10ec8af2268824041e3ea911 is the first bad commit
commit b838d5e1c5b6e57b10ec8af2268824041e3ea911
Author: Wei Wang 
Date:   Sat Jun 17 10:42:32 2017 -0700

    ipv4: mark DST_NOGC and remove the operation of dst_free()

    With the previous preparation patches, we are ready to get rid of the
    dst gc operation in ipv4 code and release dst based on refcnt only.
    So this patch adds DST_NOGC flag for all IPv4 dst and remove the 
calls

    to dst_free().
    At this point, all dst created in ipv4 code do not use the dst gc
    anymore and will be destroyed at the point when refcnt drops to 0.

    Signed-off-by: Wei Wang 
    Acked-by: Martin KaFai Lau 
    Signed-off-by: David S.

Re: Latest net-next from GIT panic




W dniu 2017-09-20 o 21:23, Paweł Staszewski pisze:



W dniu 2017-09-20 o 21:13, Paweł Staszewski pisze:



W dniu 2017-09-20 o 20:36, Cong Wang pisze:
On Wed, Sep 20, 2017 at 11:30 AM, Eric Dumazet 
 wrote:

On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote:

but dmesg at this time shows nothing about interfaces or flaps.

This is very odd.

We only free netdevice in free_netdev() and it is only called when
we unregister a netdevice. Otherwise pcpu_refcnt is impossible
to be NULL.

If there is a missing dev_hold() or one dev_put() in excess,
this would allow the netdev to be freed too soon.

-> Use after free.
memory holding netdev could be reallocated-cleared by some other 
kernel

user.

Sure, but only unregister could trigger a free. If there is no 
unregister,

like what Pawel claims, then there is no free, the refcnt just goes to
0 but the memory is still there.

About possible mistake from my side with bisect - i can judge too 
early that some bisect was good

the road was:
git bisect start
# bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 
'pinctrl-v4.13-1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075
# good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 
'next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
# bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid 
using stack larger than 1024.

git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f
# good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch 
'udp-reduce-cache-pressure'

git bisect good 073cf9e20c333ab29744717a23f9e43ec7512a20
# bad: [8abd5599a520e9f188a750f1bde9dde5fb856230] Merge branch 
's390-net-updates-part-2'

git bisect bad 8abd5599a520e9f188a750f1bde9dde5fb856230
# good: [2fae5d0e647c6470d206e72b5fc24972bb900f70] Merge branch 
'bpf-ctx-narrow'

git bisect good 2fae5d0e647c6470d206e72b5fc24972bb900f70
# good: [41500c3e2a19ffcf40a7158fce1774de08e26ba2] rds: tcp: remove 
cp_outgoing

git bisect good 41500c3e2a19ffcf40a7158fce1774de08e26ba2
# bad: [8917a777be3ba566377be05117f71b93a5fd909d] tcp: md5: add 
TCP_MD5SIG_EXT socket option to set a key address prefix

git bisect bad 8917a777be3ba566377be05117f71b93a5fd909d
# good: [4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36] net: introduce a 
new function dst_dev_put()


And currently have this running for about 4 hours without problems.



git bisect good 4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36
# bad: [a4c2fd7f78915a0d7c5275e7612e7793157a01f2] net: remove 
DST_NOCACHE flag


Here for sure - panic

git bisect bad a4c2fd7f78915a0d7c5275e7612e7793157a01f2
# bad: [ad65a2f05695aced349e308193c6e2a6b1d87112] ipv6: call 
dst_hold_safe() properly

git bisect bad ad65a2f05695aced349e308193c6e2a6b1d87112
# good: [9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call 
dst_hold_safe() properly

git bisect good 9df16efadd2a8a82731dc76ff656c771e261827f
# bad: [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take 
dst->__refcnt for insertion into fib6 tree


im not 100% sure tor last two
Will test them again starting from
[95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call dst_dev_put() 
properly



git bisect bad 1cfb71eeb12047bcdbd3e6730ffed66e810a0855
# bad: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC 
and remove the operation of dst_free()




git bisect bad b838d5e1c5b6e57b10ec8af2268824041e3ea911
# first bad commit: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: 
mark DST_NOGC and remove the operation of dst_free()





What i can say more
I can reproduce this on any server with similar configuration
the difference can be teamd instead of bonding
ixgbe or i40e and mlx5
Same problems

vlans - more or less prefixes learned from bgp -> zebra -> netlink -> 
kernel
But normally in lab when using only plain routing no bgpd and about 
128 vlans - with 128 routes - cant reproduce this - this apperas only 
with bgp - minimum where i can reproduce this was about 130k prefixes 
with about 286 nexthops






bisected again and same result:
b838d5e1c5b6e57b10ec8af2268824041e3ea911 is the first bad commit
commit b838d5e1c5b6e57b10ec8af2268824041e3ea911
Author: Wei Wang 
Date:   Sat Jun 17 10:42:32 2017 -0700

    ipv4: mark DST_NOGC and remove the operation of dst_free()

    With the previous preparation patches, we are ready to get rid of the
    dst gc operation in ipv4 code and release dst based on refcnt only.
    So this patch adds DST_NOGC flag for all IPv4 dst and remove the calls
    to dst_free().
    At this point, all dst created in ipv4 code do not use the dst gc
    anymore and will be destroyed at the point when refcnt drops to 0.

    Signed-off-by: Wei Wang 
    Acked-by: Martin KaFai Lau 
    Signed-off-by: David S. Miller 

:04 04 9b7e7fb641de6531fc7887473ca47ef7cb6a11da 
831a73b71d3df1755f3e

Re: Latest net-next from GIT panic