Re: svn commit: r341578 - head/sys/dev/mlx5/mlx5_en

2019-12-18 Thread Hans Petter Selasky

On 2019-12-18 07:26, Hans Petter Selasky wrote:

On 2019-12-17 18:14, Andrey V. Elsukov wrote:

On 13.12.2019 17:27, Hans Petter Selasky wrote:

On 2019-12-13 14:40, Andrey V. Elsukov wrote:

On 05.12.2018 17:20, Slava Shwartsman wrote:

Author: slavash
Date: Wed Dec  5 14:20:57 2018
New Revision: 341578
URL: https://svnweb.freebsd.org/changeset/base/341578

Log:
    mlx5en: Remove the DRBR and associated logic in the transmit path.
       The hardware queues are deep enough currently and using the
DRBR and associated
    callbacks only leads to more task switching in the TX path. The is
also a race
    setting the queue_state which can lead to hung TX rings.


JFYI. We have compared the same router+firewall workloads on the host
with this change and before, and I can say, that without DRBR on TX now
we constantly have several percents of packets drops due to ENOBUFS
error from mlx5e_xmit().



Have you tried to tune the TX/RX parameters?

Especially the tx_queue_size .


We use the following settings:
% sysctl dev.mce.4.conf. | grep que
dev.mce.4.conf.rx_queue_size: 16384
dev.mce.4.conf.tx_queue_size: 16384
dev.mce.4.conf.rx_queue_size_max: 16384
dev.mce.4.conf.tx_queue_size_max: 16384

Also, previously I have patched MLX5E_SQ_TX_QUEUE_SIZE value up to 16384.


Hi,

What about the other parameters. Did you tune any of those?

At what rate does this happen?

Can you send me the full dev.mce.4 sysctl tree off-list?



Are you using any performance options like RSS in the kernel?

How many NUMA domains does this machine have? Have you tuned the driver 
threads, like binding interrupt threads to CPU's?


--HPS

___
svn-src-head@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r341578 - head/sys/dev/mlx5/mlx5_en

2019-12-17 Thread Hans Petter Selasky

On 2019-12-17 18:14, Andrey V. Elsukov wrote:

On 13.12.2019 17:27, Hans Petter Selasky wrote:

On 2019-12-13 14:40, Andrey V. Elsukov wrote:

On 05.12.2018 17:20, Slava Shwartsman wrote:

Author: slavash
Date: Wed Dec  5 14:20:57 2018
New Revision: 341578
URL: https://svnweb.freebsd.org/changeset/base/341578

Log:
    mlx5en: Remove the DRBR and associated logic in the transmit path.
       The hardware queues are deep enough currently and using the
DRBR and associated
    callbacks only leads to more task switching in the TX path. The is
also a race
    setting the queue_state which can lead to hung TX rings.


JFYI. We have compared the same router+firewall workloads on the host
with this change and before, and I can say, that without DRBR on TX now
we constantly have several percents of packets drops due to ENOBUFS
error from mlx5e_xmit().



Have you tried to tune the TX/RX parameters?

Especially the tx_queue_size .


We use the following settings:
% sysctl dev.mce.4.conf. | grep que
dev.mce.4.conf.rx_queue_size: 16384
dev.mce.4.conf.tx_queue_size: 16384
dev.mce.4.conf.rx_queue_size_max: 16384
dev.mce.4.conf.tx_queue_size_max: 16384

Also, previously I have patched MLX5E_SQ_TX_QUEUE_SIZE value up to 16384.


Hi,

What about the other parameters. Did you tune any of those?

At what rate does this happen?

Can you send me the full dev.mce.4 sysctl tree off-list?

--HPS
___
svn-src-head@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r341578 - head/sys/dev/mlx5/mlx5_en

2019-12-17 Thread Andrey V. Elsukov
On 13.12.2019 17:27, Hans Petter Selasky wrote:
> On 2019-12-13 14:40, Andrey V. Elsukov wrote:
>> On 05.12.2018 17:20, Slava Shwartsman wrote:
>>> Author: slavash
>>> Date: Wed Dec  5 14:20:57 2018
>>> New Revision: 341578
>>> URL: https://svnweb.freebsd.org/changeset/base/341578
>>>
>>> Log:
>>>    mlx5en: Remove the DRBR and associated logic in the transmit path.
>>>       The hardware queues are deep enough currently and using the
>>> DRBR and associated
>>>    callbacks only leads to more task switching in the TX path. The is
>>> also a race
>>>    setting the queue_state which can lead to hung TX rings.
>>
>> JFYI. We have compared the same router+firewall workloads on the host
>> with this change and before, and I can say, that without DRBR on TX now
>> we constantly have several percents of packets drops due to ENOBUFS
>> error from mlx5e_xmit().
>>
> 
> Have you tried to tune the TX/RX parameters?
> 
> Especially the tx_queue_size .

We use the following settings:
% sysctl dev.mce.4.conf. | grep que
dev.mce.4.conf.rx_queue_size: 16384
dev.mce.4.conf.tx_queue_size: 16384
dev.mce.4.conf.rx_queue_size_max: 16384
dev.mce.4.conf.tx_queue_size_max: 16384

Also, previously I have patched MLX5E_SQ_TX_QUEUE_SIZE value up to 16384.

-- 
WBR, Andrey V. Elsukov



signature.asc
Description: OpenPGP digital signature


Re: svn commit: r341578 - head/sys/dev/mlx5/mlx5_en

2019-12-13 Thread Hans Petter Selasky

On 2019-12-13 14:40, Andrey V. Elsukov wrote:

On 05.12.2018 17:20, Slava Shwartsman wrote:

Author: slavash
Date: Wed Dec  5 14:20:57 2018
New Revision: 341578
URL: https://svnweb.freebsd.org/changeset/base/341578

Log:
   mlx5en: Remove the DRBR and associated logic in the transmit path.
   
   The hardware queues are deep enough currently and using the DRBR and associated

   callbacks only leads to more task switching in the TX path. The is also a 
race
   setting the queue_state which can lead to hung TX rings.


JFYI. We have compared the same router+firewall workloads on the host
with this change and before, and I can say, that without DRBR on TX now
we constantly have several percents of packets drops due to ENOBUFS
error from mlx5e_xmit().



Have you tried to tune the TX/RX parameters?

Especially the tx_queue_size .

--HPS
___
svn-src-head@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r341578 - head/sys/dev/mlx5/mlx5_en

2019-12-13 Thread Andrey V. Elsukov
On 05.12.2018 17:20, Slava Shwartsman wrote:
> Author: slavash
> Date: Wed Dec  5 14:20:57 2018
> New Revision: 341578
> URL: https://svnweb.freebsd.org/changeset/base/341578
> 
> Log:
>   mlx5en: Remove the DRBR and associated logic in the transmit path.
>   
>   The hardware queues are deep enough currently and using the DRBR and 
> associated
>   callbacks only leads to more task switching in the TX path. The is also a 
> race
>   setting the queue_state which can lead to hung TX rings.

JFYI. We have compared the same router+firewall workloads on the host
with this change and before, and I can say, that without DRBR on TX now
we constantly have several percents of packets drops due to ENOBUFS
error from mlx5e_xmit().

-- 
WBR, Andrey V. Elsukov



signature.asc
Description: OpenPGP digital signature


Re: svn commit: r341578 - head/sys/dev/mlx5/mlx5_en

2018-12-19 Thread Bruce Evans

On Wed, 19 Dec 2018, Bruce Evans wrote:


On Wed, 19 Dec 2018, Bruce Evans wrote:


On Mon, 17 Dec 2018, Andrew Gallatin wrote:


On 12/17/18 2:08 PM, Bruce Evans wrote:

* ...

iflib uses queuing techniques to significantly pessimize em NICs with 1
hardware queue.?? On fast machines, it attempts to do 1 context switch 
per

...

This can happen even w/o contention when "abdicate" is enabled in mp
ring. I complained about this as well, and the default was changed in
mp ring to not always "abdicate" (eg, switch to the tq to handle the
packet). Abdication substantially pessimizes Netflix style web uncontended 
workloads, but it generally helps small packet forwarding.


It is interesting that you see the opposite.  I should try benchmarking
with just a single ring.


Hmm, I didn't remember "abdicated" and never knew about the sysctl for it
(the sysctl is newer), but I notices the slowdown from near the first
commit for it (r323954) and already used the folowing workaround for it:
...
This essentialy just adds back the previous code with a flag to check
both versions.  Hopefully the sysctl can do the same thing.


It doesn't.  Setting tx_abdicate to 1 gives even more context switches 
(almost

twice as many, 800k/sec instead of 400k/sec, on i386 pessimized by
INVARIANTS, WITNESS, !WITNESS_SKIPSPIN, 4G KVA and more.  Without
...


I now understand most of the slownesses and variations in benchmarks.

Short summary:

After arcane tuning including a sysctl only available in my version
of SCHED_4BSD, on amd64 iflib in -current runs as fast as old old em
with EM_MULTIQUEUE and no other tuning in FreeBSD-11; i386 also needs
a CPU almost 3 times faster to compensate for the overhead of having
4G KVA (bit no other security pessimizations in either).

Long summary:

iflib with tx_abdicate=0 runs a bit like old em without EM_MULTIQUEUE,
provided the NIC is I218V and not PRO1000 and/or the CPU is too slow
to saturate the NIC and/or the network.  iflib is just 10% slower.
Neither does excessive context switches to tgq with I218V (context
switches seem to be limited to not much more than 2 per h/w interrupt,
and h/w interrupts are normally moderated to 8kHz).  However, iflib
does excessive context switches for PRO1000.  I don't know if this is
for hardware reasons or just for dropping packets.

iflib with tx_abdicate=1 runs a bit like old em with EM_MULTIQUEUE.  Due
to general slowness, even a 4GHz i7 has difficulty saturating 1Gbps ethernet
with small packets.  tx_abdicate=1 allows it to saturate by using tgq more.
This causes lots of context switches and otherwise uses lots of CPU (60%
of a 4GHz i7 for iflib).  Old em with EM_MULTIQUEUE gives identical kpps
and saturation and dropped packets for spare cycles on the CPU producing
the packets, but I think it does less context switches and uses less CPU
for tgq.  This is mostly for the I218V.

I got apparently-qualitativly-different results on i386 because I mostly
tested i386 with the PRO1000 where there are excessive context switches
on both amd64 and i386 with tx_abdicate=0.  tx_abdicate=1 gives even more
excessive context switches (about twice as many) for the PRO1000.

I got apparently-qualitativly-different results for some old benchmarks
because I used an old version of FreeBSD (r332488) for many of them, and
also had version problems within this version.  iflib in this version
forces tx_abdicate=1.  I noticed the extra context switches from this
long ago, and had an option which defaulted to using older iflib code
which seemed to work better.  But I misedited the non-default case of
this and had the double drainage check bug that was added in -current
in r366560 and fixed in -current in r341824.  This gave excessive extra
context switches, so the commit that added abdication (r323954) seemed
to be even slower than it was.

The fastest case by a significant amount (saturation on I218V using
1.6 times less CPU) is with netblast bound to the same CPU as tgq,
PREEMPTION* not configured, and my scheduler modification that reduces
preemption even further, and this modification selected using a sysctl),
and tx_abdicate=1.  Then the scheduler modification delays most switches
to tgq, and tx_abdicate=1 apparently allows such context switches when
they are useful (I think netblast fills a queue and then tx_abdiscate=1
gives a context switch immediately, but tx_abdicate=0 doesn't give a
context switch soon enough).  But without the scheduler modification,
this is the slowest case (tx_abdicate=1 forces context switches to tgq
after every packet, and since netblast is bound to the same CPU, it
can't run.  In both cases, only 1 CPU is used, but the context switches
reduce throughput by about a factor of 2.

It is less clear why througput counting dropped packets is lower for
netblast not bound and tx_abdicate=0.  Then tgq apparently doesn't run
promptly enough to saturate the net, but netblast has its own CPU so
it doesn't stop when tgq runs so it should be able to produce even

Re: svn commit: r341578 - head/sys/dev/mlx5/mlx5_en

2018-12-18 Thread Bruce Evans

On Wed, 19 Dec 2018, Bruce Evans wrote:


On Mon, 17 Dec 2018, Andrew Gallatin wrote:


On 12/17/18 2:08 PM, Bruce Evans wrote:

* ...

iflib uses queuing techniques to significantly pessimize em NICs with 1
hardware queue.?? On fast machines, it attempts to do 1 context switch per

...

This can happen even w/o contention when "abdicate" is enabled in mp
ring. I complained about this as well, and the default was changed in
mp ring to not always "abdicate" (eg, switch to the tq to handle the
packet). Abdication substantially pessimizes Netflix style web uncontended 
workloads, but it generally helps small packet forwarding.


It is interesting that you see the opposite.  I should try benchmarking
with just a single ring.


Hmm, I didn't remember "abdicated" and never knew about the sysctl for it
(the sysctl is newer), but I notices the slowdown from near the first
commit for it (r323954) and already used the folowing workaround for it:
...
This essentialy just adds back the previous code with a flag to check
both versions.  Hopefully the sysctl can do the same thing.


It doesn't.  Setting tx_abdicate to 1 gives even more context switches (almost
twice as many, 800k/sec instead of 400k/sec, on i386 pessimized by
INVARIANTS, WITNESS, !WITNESS_SKIPSPIN, 4G KVA and more.  Without
pessimizations it does 1M/sec instea of 400k/sec).  The behaviour is easy
to understand by watchomg top -SH -m io with netblast bound to the same
CPU as the main tgq.  Then netblast does involuntary context switches at
the same rate that the tgq does voluntary context switches, and tx_abdicate=1
doubles this rare.  netblast only switches at the quantum rate (11 per second)
when not bound (I think it does null switches and it is a bug to count these
as switches, but even null switches do too much).

This is also without my usual default of !PREEMPTION && !IPI_PREEMPTION.
Binding the netblast to the same CPU as the tgq only stops the excessive
context switches wihen !PREEMPTION.  My hack might depend on this too.
Unfortunately, the hack is not in the same kernels as the sysctl, and I
already have too many combinations to test.

Another test with only 4G KVA (no INVARIANTS, etc., no PREEMPTION):
tx_abdicate=0: tgq switch rate  997-1017k/sec (16k/sec if netblast bound)
tx_abdicate=1: tgq switch rate 1300-1350k/sec (16k/sec if netblast bound)

Another test on amd64 to escape i386 4G KVA pessimizations:
tx_abdicate=0: tgq switch rate 1110-1220k/sec (16k/sec if netblast bound)
tx_abdicate=1: tgq switch rate 1360-1430k/sec (16k/sec if netblast bound)

When netblast is bound to the tgq's CPU, the tgq actually runs on another
CPU.  Apparently, the binding is weak ot this is a bugfeature in my scheduler.

When tx_abdicate=1, the switch rate is close to the packet rate.  Since the
NIC can't keep up, most packets are dropped.  On amd64 with tx_abdicate=1,
the packet rates are:

netblast bound:   313kpps sent, 1604kpps dropped
netblast unbound: 253kpps sent, 1153kpps dropped

253kpps sent is bad.  This indicates large latencies (not due to !PREEMPTION
or secheduler bugs AFAIK).  Most tests with netblast unbound seemed to saturate
the NIC at 280kpps (but the tests with netblast bound shows that the NIC can
go a little faster).  Even an old 2GHz CPU can reach 280kpps.

This shows another problem with taskqueues.  It takes context switches just
to decide to drop packets.  Previous versions of iflib were much slower at
dropping packets.  Some had rates closer to the low send rate than the 1604kpps
achieved above.  FreeBSD-5 running on a single 3 times slower CPU can drop
packets at 2124kpps, mainly by dropping them in ip_output() after peeking at
the software ifqs to see that there is no space.  IFF_MONITOR gives better
tests of the syscall overhead.

Another test with amd64 and I218V instead of PRO1000:

netblast bound, !abdicate:   1243kpps sent,   0kpps dropped  (16k/sec csw)
netblast unbound, !abdicate: 1236kpps sent,   0kpps dropped  (16k/sec csw)
netblast bound, abdicate:1485kpps sent, 243kpps dropped  (16k/sec csw)
netblast unbound, abdicate:  1407kpps sent, 1.7kpps dropped (850k/sec csw)

There is an i386 dependency after all!  !abdicate works on amd64 but not
on i386 to prevent the excessive context switches.  Unfortunately, it also
reduces kpps by almost 20% and leaves no spare CPU for dropping packets.

The best case of netblast bound, abdicate is competitive with FreeBSD-11
on i386 with EM_MULTIQUEUE: above result repeated:

netblast bound, abdicate:1485kpps sent, 243kpps dropped  (16k/sec csw)

previous best result:

FBSD-11 SMP-8 1486+241 # no iflib, use EM_MULTIQUEUE (now saturate 1Gbps)

(this is without PREEMPTION* and without binding netblast).

The above for -current also has the lowest possible CPU use (100% of 1 CPU
for all threads, while netblast unbound takes 100% of 1 CPU for netblast and
60% of another CPU for tgq), and I think the FBSD=11 case takes 100% of 1
CPU for netblast unbound and a tiny% of another CPU fo

Re: svn commit: r341578 - head/sys/dev/mlx5/mlx5_en

2018-12-18 Thread Bruce Evans

On Mon, 17 Dec 2018, Andrew Gallatin wrote:


On 12/17/18 2:08 PM, Bruce Evans wrote:

On Mon, 17 Dec 2018, Andrew Gallatin wrote:


On 12/5/18 9:20 AM, Slava Shwartsman wrote:

Author: slavash
Date: Wed Dec?? 5 14:20:57 2018
New Revision: 341578
URL: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__svnweb.freebsd.org_changeset_base_341578&d=DwIDaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=Ed-falealxPeqc22ehgAUCLh8zlZbibZLSMWJeZro4A&m=BFp2c_-S0jnzRZJF2APwvTwmnmVFcyjcnBvHRZ3Locc&s=b7fvhOzf_b5bMVGquu4SaBhMNql5N8dVPAvpfKtz53Q&e= 


Log:
 mlx5en: Remove the DRBR and associated logic in the transmit path.
?? The hardware queues are deep enough currently and using the 
DRBR and associated
 callbacks only leads to more task switching in the TX path. The is 
also a race

 setting the queue_state which can lead to hung TX rings.


The point of DRBR in the tx path is not simply to provide a software ring 
for queuing excess packets.?? Rather it provides a mechanism to

avoid lock contention by shoving a packet into the software ring, where
it will later be found & processed, rather than blocking the caller on
a mtx lock. I'm concerned you may have introduced a performance
regression for use cases where you have N:1?? or N:M lock contention where 
many threads on different cores are contending for the same tx queue.?? 
The state of the art for this is no longer DRBR, but mp_ring,

as used by both cxgbe and iflib.


iflib uses queuing techniques to significantly pessimize em NICs with 1
hardware queue.?? On fast machines, it attempts to do 1 context switch per


Bruce Evans didn't write that.  Some mail program converted 2-space sentence
breaks to \xc2\xa0.


This can happen even w/o contention when "abdicate" is enabled in mp
ring. I complained about this as well, and the default was changed in
mp ring to not always "abdicate" (eg, switch to the tq to handle the
packet). Abdication substantially pessimizes Netflix style web uncontended 
workloads, but it generally helps small packet forwarding.


It is interesting that you see the opposite.  I should try benchmarking
with just a single ring.


Hmm, I didn't remember "abdicated" and never knew about the sysctl for it
(the sysctl is newer), but I notices the slowdown from near the first
commit for it (r323954) and already used the folowing workaround for it:

XX Index: iflib.c
XX ===
XX --- iflib.c  (revision 332488)
XX +++ iflib.c  (working copy)
XX @@ -1,3 +1,5 @@
XX +int bde_oldnet = 1;
XX +
XX  /*-
XX   * Copyright (c) 2014-2018, Matthew Macy 
XX   * All rights reserved.
XX @@ -3650,9 +3652,17 @@
XX  IFDI_TX_QUEUE_INTR_ENABLE(ctx, txq->ift_id);
XX  return;
XX  }
XX +if (bde_oldnet) {
XX  if (txq->ift_db_pending)
XX  ifmp_ring_enqueue(txq->ift_br, (void **)&txq, 1, TX_BATCH_SIZE);
XX +else
XX +ifmp_ring_check_drainage(txq->ift_br, TX_BATCH_SIZE);
XX +} else {
XX +if (txq->ift_db_pending)
XX +ifmp_ring_enqueue(txq->ift_br, (void **)&txq, 1, TX_BATCH_SIZE);
XX  ifmp_ring_check_drainage(txq->ift_br, TX_BATCH_SIZE);
XX +}
XX +ifmp_ring_check_drainage(txq->ift_br, TX_BATCH_SIZE);
XX  if (ctx->ifc_flags & IFC_LEGACY)
XX  IFDI_INTR_ENABLE(ctx);
XX  else {
XX @@ -3862,8 +3872,11 @@
XX  DBG_COUNTER_INC(tx_seen);
XX  err = ifmp_ring_enqueue(txq->ift_br, (void **)&m, 1, TX_BATCH_SIZE);
XX 
XX +if (!bde_oldnet)

XX  GROUPTASK_ENQUEUE(&txq->ift_task);
XX  if (err) {
XX +if (bde_oldnet)
XX +GROUPTASK_ENQUEUE(&txq->ift_task);
XX  /* support forthcoming later */
XX  #ifdef DRIVER_BACKPRESSURE
XX  txq->ift_closed = TRUE;
XX @@ -3870,6 +3883,9 @@
XX  #endif
XX  ifmp_ring_check_drainage(txq->ift_br, TX_BATCH_SIZE);
XX  m_freem(m);
XX +} else if (TXQ_AVAIL(txq) < (txq->ift_size >> 1)) {
XX +if (bde_oldnet)
XX +GROUPTASK_ENQUEUE(&txq->ift_task);
XX  }
XX 
XX  	return (err);

XX Index: mp_ring.c
XX ===
XX --- mp_ring.c(revision 332488)
XX +++ mp_ring.c(working copy)
XX @@ -1,3 +1,11 @@
XX +#include "opt_pci.h"
XX +
XX +#ifdef DEV_PCI
XX +extern int bde_oldnet;
XX +#else
XX +#define bde_oldnet  0
XX +#endif
XX +
XX  /*-
XX   * Copyright (c) 2014 Chelsio Communications, Inc.
XX   * All rights reserved.
XX @@ -454,12 +462,25 @@
XX  do {
XX  os.state = ns.state = r->state;
XX  ns.pidx_tail = pidx_stop;
XX +if (bde_oldnet)
XX +ns.flags = BUSY;
XX +else {
XX  if (os.flags == IDLE)
XX  ns.flags = ABDICATED;
XX +}
XX  } while (atomic_cmpset_rel_64(&r->state, os.state, ns.state) == 0);
XX  critical_exit();
XX  counter_u64_add(r->enqueues, n);
XX 
XX +if (bde_oldnet) {

XX +/*
XX + * Turn into a consumer if some other thread i

Re: svn commit: r341578 - head/sys/dev/mlx5/mlx5_en

2018-12-18 Thread Gary Jennejohn
On Mon, 17 Dec 2018 14:50:04 -0500
Andrew Gallatin  wrote:

> On 12/17/18 2:08 PM, Bruce Evans wrote:

[snip]
> > iflib uses queuing techniques to significantly pessimize em NICs with 1
> > hardware queue. On fast machines, it attempts to do 1 context switch per  
> 
> This can happen even w/o contention when "abdicate" is enabled in mp
> ring. I complained about this as well, and the default was changed in
> mp ring to not always "abdicate" (eg, switch to the tq to handle the
> packet). Abdication substantially pessimizes Netflix style web 
> uncontended workloads, but it generally helps small packet forwarding.
> 
> It is interesting that you see the opposite.  I should try benchmarking
> with just a single ring.
> 

Why are iflib and ifdi compiled into EVERY kernel with device
ether and/or device pci when only a few NICs actually use iflib? 
This is really unnecessary bloat in an already bloated kernel.

I use if_re which does not use iflib.

I removed iflib and ifdi from /sys/conf/files and my network
still works just fine.  It seems to me like these iflib entries
need finer-grained options, e.g. one of the NICs which use
iflib is enabled, before pulling them into the kernel build.

-- 
Gary Jennejohn
___
svn-src-head@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


Re: svn commit: r341578 - head/sys/dev/mlx5/mlx5_en

2018-12-17 Thread Andrew Gallatin

On 12/17/18 2:08 PM, Bruce Evans wrote:

On Mon, 17 Dec 2018, Andrew Gallatin wrote:


On 12/5/18 9:20 AM, Slava Shwartsman wrote:

Author: slavash
Date: Wed Dec  5 14:20:57 2018
New Revision: 341578
URL: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__svnweb.freebsd.org_changeset_base_341578&d=DwIDaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=Ed-falealxPeqc22ehgAUCLh8zlZbibZLSMWJeZro4A&m=BFp2c_-S0jnzRZJF2APwvTwmnmVFcyjcnBvHRZ3Locc&s=b7fvhOzf_b5bMVGquu4SaBhMNql5N8dVPAvpfKtz53Q&e= 



Log:
   mlx5en: Remove the DRBR and associated logic in the transmit path.
  The hardware queues are deep enough currently and using the 
DRBR and associated
   callbacks only leads to more task switching in the TX path. The is 
also a race

   setting the queue_state which can lead to hung TX rings.


The point of DRBR in the tx path is not simply to provide a software 
ring for queuing excess packets.  Rather it provides a mechanism to

avoid lock contention by shoving a packet into the software ring, where
it will later be found & processed, rather than blocking the caller on
a mtx lock.   I'm concerned you may have introduced a performance
regression for use cases where you have N:1  or N:M lock contention 
where many threads on different cores are contending for the same tx 
queue.  The state of the art for this is no longer DRBR, but mp_ring,

as used by both cxgbe and iflib.


iflib uses queuing techniques to significantly pessimize em NICs with 1
hardware queue.  On fast machines, it attempts to do 1 context switch per


This can happen even w/o contention when "abdicate" is enabled in mp
ring. I complained about this as well, and the default was changed in
mp ring to not always "abdicate" (eg, switch to the tq to handle the
packet). Abdication substantially pessimizes Netflix style web 
uncontended workloads, but it generally helps small packet forwarding.


It is interesting that you see the opposite.  I should try benchmarking
with just a single ring.




(small) tx packet and can't keep up.  On slow machines it has a chance of
handling multiple packets per context switch, but since the machine is too
slow it can't keep up and saturates at a slightly different point.  Results
for netblast $lanhost 5001 5 10 (5-byte payload for 10 seconds) on an I218V
on Haswell 4 cores x 2 threads @4.08GHz running i386:

Old results with no iflib and no EM_MULTIQUEUE except as indicated:

FBSD-10 UP    1377+0
FBSD-11 UP    1326+0
FBSD-11 SMP-1 1484+0
FBSD-11 SMP-8 1395+0
FBSD-12mod  SMP-1 1386+0
FBSD-12mod  SMP-8 1422+0
FBSD-12mod  SMP-1 1270+0   # use iflib (lose 8% performance)
FBSD-12mod  SMP-8 1279+0   # use iflib (lose 10% performance using more 
CPU)


1377+0 means 1377 kpps sent and 0 kpps errors, etc.  SMP-8 means use all 8
CPUs.  SMP-1 means restrict netblast to 1 CPU different from the taskqueue
CPUs using cpuset.

New results:

FBSD-11 SMP-8 1440+0   # no iflib, no EM_MULTIQUEUE
FBSD-11 SMP-8 1486+241 # no iflib, use EM_MULTIQUEUE (now saturate 
1Gbps)

FBSD-cur    SMP-8  533+0   # use iflib, use i386 with 4G KVA

iflib only decimates performance relative to the FreeBSD-11 version
with no EM_MULTIQUEUE, but EM_MULTIQUEUE gives better queueing using
more CPUs.  This gives the extra 10-20% of performance needed to
saturate the NIC and 1Gbps ethernet.  The FreeBSD-current version is
not directly comparable since using 4G KVA on i386 reduces performance
by about a factor of 2.5 for all loads with mostly small i/o's (for
128K disk i/o's the reduction is only 10-20%).  i386 ran at about the
same speed as amd64 when it had 1GB KVA, but I don't have any savd
results for amd64 to compare with precisely).  This is all with
security-related things like ibrs unavailable or turned off.

All versions use normal Intel interrupt moderation which gives an interrupt
rate of 8k/sec.

Old versions of em use a "fast" interrupt handler and a slow switch
to a taskqueue.  This gives a contex switch rate of about 16k/ sec.
In the SMP case, netblast normally runs on another CPU and I think it
fills h/w tx queue(s) synchronously, and the taskqueue only does minor
cleanups.  Old em also has a ping latency of about 10% smaller than
with iflib (73 usec instead of 80 usec after setting em.x.itr to 0 and
other tuning to kill interrupt moderation, and similar for a bge NIC
on the other end).  The synchronous queue filling probably improves
latency, but it is hard to see how it makes a difference of more than
1 usec.  73 is already too high.  An old PRO1000 Intel NIC has a latency
of only 50 usec on the same network.  The switch costs about 20 usec
of this.

iflib uses taskqueue more.  netblast normally runs on another CPU and
I think it only fills s/w tx queue(s) synchronously, and wakes up the
taskqueues for every packet.  The CPUs are almost fast enough to keep
up, and the system does about 1M context switches for this (in versions
other than i386 with 4G KVA).  That is slightly more than 2 packets per
switch to 

Re: svn commit: r341578 - head/sys/dev/mlx5/mlx5_en

2018-12-17 Thread Bruce Evans

On Mon, 17 Dec 2018, Andrew Gallatin wrote:


On 12/5/18 9:20 AM, Slava Shwartsman wrote:

Author: slavash
Date: Wed Dec  5 14:20:57 2018
New Revision: 341578
URL: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__svnweb.freebsd.org_changeset_base_341578&d=DwIDaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=Ed-falealxPeqc22ehgAUCLh8zlZbibZLSMWJeZro4A&m=BFp2c_-S0jnzRZJF2APwvTwmnmVFcyjcnBvHRZ3Locc&s=b7fvhOzf_b5bMVGquu4SaBhMNql5N8dVPAvpfKtz53Q&e=


Log:
   mlx5en: Remove the DRBR and associated logic in the transmit path.
  The hardware queues are deep enough currently and using the DRBR and 
associated
   callbacks only leads to more task switching in the TX path. The is also 
a race

   setting the queue_state which can lead to hung TX rings.


The point of DRBR in the tx path is not simply to provide a software ring for 
queuing excess packets.  Rather it provides a mechanism to

avoid lock contention by shoving a packet into the software ring, where
it will later be found & processed, rather than blocking the caller on
a mtx lock.   I'm concerned you may have introduced a performance
regression for use cases where you have N:1  or N:M lock contention where 
many threads on different cores are contending for the same tx queue.  The 
state of the art for this is no longer DRBR, but mp_ring,

as used by both cxgbe and iflib.


iflib uses queuing techniques to significantly pessimize em NICs with 1
hardware queue.  On fast machines, it attempts to do 1 context switch per
(small) tx packet and can't keep up.  On slow machines it has a chance of
handling multiple packets per context switch, but since the machine is too
slow it can't keep up and saturates at a slightly different point.  Results
for netblast $lanhost 5001 5 10 (5-byte payload for 10 seconds) on an I218V
on Haswell 4 cores x 2 threads @4.08GHz running i386:

Old results with no iflib and no EM_MULTIQUEUE except as indicated:

FBSD-10 UP1377+0
FBSD-11 UP1326+0
FBSD-11 SMP-1 1484+0
FBSD-11 SMP-8 1395+0
FBSD-12mod  SMP-1 1386+0
FBSD-12mod  SMP-8 1422+0
FBSD-12mod  SMP-1 1270+0   # use iflib (lose 8% performance)
FBSD-12mod  SMP-8 1279+0   # use iflib (lose 10% performance using more CPU)

1377+0 means 1377 kpps sent and 0 kpps errors, etc.  SMP-8 means use all 8
CPUs.  SMP-1 means restrict netblast to 1 CPU different from the taskqueue
CPUs using cpuset.

New results:

FBSD-11 SMP-8 1440+0   # no iflib, no EM_MULTIQUEUE
FBSD-11 SMP-8 1486+241 # no iflib, use EM_MULTIQUEUE (now saturate 1Gbps)
FBSD-curSMP-8  533+0   # use iflib, use i386 with 4G KVA

iflib only decimates performance relative to the FreeBSD-11 version
with no EM_MULTIQUEUE, but EM_MULTIQUEUE gives better queueing using
more CPUs.  This gives the extra 10-20% of performance needed to
saturate the NIC and 1Gbps ethernet.  The FreeBSD-current version is
not directly comparable since using 4G KVA on i386 reduces performance
by about a factor of 2.5 for all loads with mostly small i/o's (for
128K disk i/o's the reduction is only 10-20%).  i386 ran at about the
same speed as amd64 when it had 1GB KVA, but I don't have any savd
results for amd64 to compare with precisely).  This is all with
security-related things like ibrs unavailable or turned off.

All versions use normal Intel interrupt moderation which gives an interrupt
rate of 8k/sec.

Old versions of em use a "fast" interrupt handler and a slow switch
to a taskqueue.  This gives a contex switch rate of about 16k/ sec.
In the SMP case, netblast normally runs on another CPU and I think it
fills h/w tx queue(s) synchronously, and the taskqueue only does minor
cleanups.  Old em also has a ping latency of about 10% smaller than
with iflib (73 usec instead of 80 usec after setting em.x.itr to 0 and
other tuning to kill interrupt moderation, and similar for a bge NIC
on the other end).  The synchronous queue filling probably improves
latency, but it is hard to see how it makes a difference of more than
1 usec.  73 is already too high.  An old PRO1000 Intel NIC has a latency
of only 50 usec on the same network.  The switch costs about 20 usec
of this.

iflib uses taskqueue more.  netblast normally runs on another CPU and
I think it only fills s/w tx queue(s) synchronously, and wakes up the
taskqueues for every packet.  The CPUs are almost fast enough to keep
up, and the system does about 1M context switches for this (in versions
other than i386 with 4G KVA).  That is slightly more than 2 packets per
switch to get the speed of 1279 kpps.  netblast uses 100% of 1 CPU but
the taskqueues don't saturate their CPUs although they should so as to
do even more context switches.  They still use a lot of CPU (about 50%
of 1 CPU more than in old em).  These context switches lose by doing
the opposite of interrupt moderation.

I can "fix" the extra context switches and restore some of the lost
performance and most of the lost CPU by running netblast on the same
CPU as the main taskqueue (and using my normal 

Re: svn commit: r341578 - head/sys/dev/mlx5/mlx5_en

2018-12-17 Thread Andrew Gallatin

On 12/5/18 9:20 AM, Slava Shwartsman wrote:

Author: slavash
Date: Wed Dec  5 14:20:57 2018
New Revision: 341578
URL: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__svnweb.freebsd.org_changeset_base_341578&d=DwIDaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=Ed-falealxPeqc22ehgAUCLh8zlZbibZLSMWJeZro4A&m=BFp2c_-S0jnzRZJF2APwvTwmnmVFcyjcnBvHRZ3Locc&s=b7fvhOzf_b5bMVGquu4SaBhMNql5N8dVPAvpfKtz53Q&e=

Log:
   mlx5en: Remove the DRBR and associated logic in the transmit path.
   
   The hardware queues are deep enough currently and using the DRBR and associated

   callbacks only leads to more task switching in the TX path. The is also a 
race
   setting the queue_state which can lead to hung TX rings.
   


The point of DRBR in the tx path is not simply to provide a software 
ring for queuing excess packets.  Rather it provides a mechanism to

avoid lock contention by shoving a packet into the software ring, where
it will later be found & processed, rather than blocking the caller on
a mtx lock.   I'm concerned you may have introduced a performance
regression for use cases where you have N:1  or N:M lock contention 
where many threads on different cores are contending for the same tx 
queue.  The state of the art for this is no longer DRBR, but mp_ring,

as used by both cxgbe and iflib.

For well behaved workloads (like Netflix's), I don't anticipate
this being a performance issue.  However, I worry that this will impact
other workloads and that you should consider running some testing of
N:1 contention.   Eg, 128 netperfs running in parallel with only
a few nic tx rings.

Sorry for the late reply.. I'm behind on my -committers email.  If you
have not already MFC'ed this, you may want to reconsider.

Drew
___
svn-src-head@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"


svn commit: r341578 - head/sys/dev/mlx5/mlx5_en

2018-12-05 Thread Slava Shwartsman
Author: slavash
Date: Wed Dec  5 14:20:57 2018
New Revision: 341578
URL: https://svnweb.freebsd.org/changeset/base/341578

Log:
  mlx5en: Remove the DRBR and associated logic in the transmit path.
  
  The hardware queues are deep enough currently and using the DRBR and 
associated
  callbacks only leads to more task switching in the TX path. The is also a race
  setting the queue_state which can lead to hung TX rings.
  
  Submitted by:   hselasky@
  Approved by:hselasky (mentor)
  MFC after:  1 week
  Sponsored by:   Mellanox Technologies

Modified:
  head/sys/dev/mlx5/mlx5_en/en.h
  head/sys/dev/mlx5/mlx5_en/mlx5_en_ethtool.c
  head/sys/dev/mlx5/mlx5_en/mlx5_en_main.c
  head/sys/dev/mlx5/mlx5_en/mlx5_en_tx.c

Modified: head/sys/dev/mlx5/mlx5_en/en.h
==
--- head/sys/dev/mlx5/mlx5_en/en.h  Wed Dec  5 14:20:26 2018
(r341577)
+++ head/sys/dev/mlx5/mlx5_en/en.h  Wed Dec  5 14:20:57 2018
(r341578)
@@ -473,7 +473,6 @@ struct mlx5e_params {
   m(+1, u64 tx_coalesce_usecs, "tx_coalesce_usecs", "Limit in usec for joining 
tx packets") \
   m(+1, u64 tx_coalesce_pkts, "tx_coalesce_pkts", "Maximum number of tx 
packets to join") \
   m(+1, u64 tx_coalesce_mode, "tx_coalesce_mode", "0: EQE mode 1: CQE mode") \
-  m(+1, u64 tx_bufring_disable, "tx_bufring_disable", "0: Enable bufring 1: 
Disable bufring") \
   m(+1, u64 tx_completion_fact, "tx_completion_fact", "1..MAX: Completion 
event ratio") \
   m(+1, u64 tx_completion_fact_max, "tx_completion_fact_max", "Maximum 
completion event ratio") \
   m(+1, u64 hw_lro, "hw_lro", "set to enable hw_lro") \
@@ -606,8 +605,6 @@ struct mlx5e_sq {
struct  mlx5e_sq_stats stats;
 
struct  mlx5e_cq cq;
-   struct  task sq_task;
-   struct  taskqueue *sq_tq;
 
/* pointers to per packet info: write@xmit, read@completion */
struct  mlx5e_sq_mbuf *mbuf;
@@ -628,7 +625,6 @@ struct mlx5e_sq {
struct  mlx5_wq_ctrl wq_ctrl;
struct  mlx5e_priv *priv;
int tc;
-   unsigned int queue_state;
 } __aligned(MLX5E_CACHELINE_SIZE);
 
 static inline bool
@@ -857,7 +853,6 @@ voidmlx5e_cq_error_event(struct mlx5_core_cq *mcq, 
in
 void   mlx5e_rx_cq_comp(struct mlx5_core_cq *);
 void   mlx5e_tx_cq_comp(struct mlx5_core_cq *);
 struct mlx5_cqe64 *mlx5e_get_cqe(struct mlx5e_cq *cq);
-void   mlx5e_tx_que(void *context, int pending);
 
 intmlx5e_open_flow_table(struct mlx5e_priv *priv);
 void   mlx5e_close_flow_table(struct mlx5e_priv *priv);

Modified: head/sys/dev/mlx5/mlx5_en/mlx5_en_ethtool.c
==
--- head/sys/dev/mlx5/mlx5_en/mlx5_en_ethtool.c Wed Dec  5 14:20:26 2018
(r341577)
+++ head/sys/dev/mlx5/mlx5_en/mlx5_en_ethtool.c Wed Dec  5 14:20:57 2018
(r341578)
@@ -703,18 +703,6 @@ mlx5e_ethtool_handler(SYSCTL_HANDLER_ARGS)
mlx5e_open_locked(priv->ifp);
break;
 
-   case MLX5_PARAM_OFFSET(tx_bufring_disable):
-   /* rangecheck input value */
-   priv->params_ethtool.tx_bufring_disable =
-   priv->params_ethtool.tx_bufring_disable ? 1 : 0;
-
-   /* reconfigure the sendqueues, if any */
-   if (was_opened) {
-   mlx5e_close_locked(priv->ifp);
-   mlx5e_open_locked(priv->ifp);
-   }
-   break;
-
case MLX5_PARAM_OFFSET(tx_completion_fact):
/* network interface must be down */
if (was_opened)

Modified: head/sys/dev/mlx5/mlx5_en/mlx5_en_main.c
==
--- head/sys/dev/mlx5/mlx5_en/mlx5_en_main.cWed Dec  5 14:20:26 2018
(r341577)
+++ head/sys/dev/mlx5/mlx5_en/mlx5_en_main.cWed Dec  5 14:20:57 2018
(r341578)
@@ -1,5 +1,5 @@
 /*-
- * Copyright (c) 2015 Mellanox Technologies. All rights reserved.
+ * Copyright (c) 2015-2018 Mellanox Technologies. All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions
@@ -1184,37 +1184,6 @@ mlx5e_create_sq(struct mlx5e_channel *c,
sq->min_inline_mode = priv->params.tx_min_inline_mode;
sq->vlan_inline_cap = MLX5_CAP_ETH(mdev, wqe_vlan_insert);
 
-   /* check if we should allocate a second packet buffer */
-   if (priv->params_ethtool.tx_bufring_disable == 0) {
-   sq->br = buf_ring_alloc(MLX5E_SQ_TX_QUEUE_SIZE, M_MLX5EN,
-   M_WAITOK, &sq->lock);
-   if (sq->br == NULL) {
-   if_printf(c->ifp, "%s: Failed allocating sq drbr 
buffer\n",
-   __func__);
-   err = -ENOMEM;
-   goto err_free_sq_db;
-   }
-
-   sq->sq_tq = task