Re: svn commit: r341578 - head/sys/dev/mlx5/mlx5_en
On 2019-12-18 07:26, Hans Petter Selasky wrote: On 2019-12-17 18:14, Andrey V. Elsukov wrote: On 13.12.2019 17:27, Hans Petter Selasky wrote: On 2019-12-13 14:40, Andrey V. Elsukov wrote: On 05.12.2018 17:20, Slava Shwartsman wrote: Author: slavash Date: Wed Dec 5 14:20:57 2018 New Revision: 341578 URL: https://svnweb.freebsd.org/changeset/base/341578 Log: mlx5en: Remove the DRBR and associated logic in the transmit path. The hardware queues are deep enough currently and using the DRBR and associated callbacks only leads to more task switching in the TX path. The is also a race setting the queue_state which can lead to hung TX rings. JFYI. We have compared the same router+firewall workloads on the host with this change and before, and I can say, that without DRBR on TX now we constantly have several percents of packets drops due to ENOBUFS error from mlx5e_xmit(). Have you tried to tune the TX/RX parameters? Especially the tx_queue_size . We use the following settings: % sysctl dev.mce.4.conf. | grep que dev.mce.4.conf.rx_queue_size: 16384 dev.mce.4.conf.tx_queue_size: 16384 dev.mce.4.conf.rx_queue_size_max: 16384 dev.mce.4.conf.tx_queue_size_max: 16384 Also, previously I have patched MLX5E_SQ_TX_QUEUE_SIZE value up to 16384. Hi, What about the other parameters. Did you tune any of those? At what rate does this happen? Can you send me the full dev.mce.4 sysctl tree off-list? Are you using any performance options like RSS in the kernel? How many NUMA domains does this machine have? Have you tuned the driver threads, like binding interrupt threads to CPU's? --HPS ___ svn-src-head@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"
Re: svn commit: r341578 - head/sys/dev/mlx5/mlx5_en
On 2019-12-17 18:14, Andrey V. Elsukov wrote: On 13.12.2019 17:27, Hans Petter Selasky wrote: On 2019-12-13 14:40, Andrey V. Elsukov wrote: On 05.12.2018 17:20, Slava Shwartsman wrote: Author: slavash Date: Wed Dec 5 14:20:57 2018 New Revision: 341578 URL: https://svnweb.freebsd.org/changeset/base/341578 Log: mlx5en: Remove the DRBR and associated logic in the transmit path. The hardware queues are deep enough currently and using the DRBR and associated callbacks only leads to more task switching in the TX path. The is also a race setting the queue_state which can lead to hung TX rings. JFYI. We have compared the same router+firewall workloads on the host with this change and before, and I can say, that without DRBR on TX now we constantly have several percents of packets drops due to ENOBUFS error from mlx5e_xmit(). Have you tried to tune the TX/RX parameters? Especially the tx_queue_size . We use the following settings: % sysctl dev.mce.4.conf. | grep que dev.mce.4.conf.rx_queue_size: 16384 dev.mce.4.conf.tx_queue_size: 16384 dev.mce.4.conf.rx_queue_size_max: 16384 dev.mce.4.conf.tx_queue_size_max: 16384 Also, previously I have patched MLX5E_SQ_TX_QUEUE_SIZE value up to 16384. Hi, What about the other parameters. Did you tune any of those? At what rate does this happen? Can you send me the full dev.mce.4 sysctl tree off-list? --HPS ___ svn-src-head@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"
Re: svn commit: r341578 - head/sys/dev/mlx5/mlx5_en
On 13.12.2019 17:27, Hans Petter Selasky wrote: > On 2019-12-13 14:40, Andrey V. Elsukov wrote: >> On 05.12.2018 17:20, Slava Shwartsman wrote: >>> Author: slavash >>> Date: Wed Dec 5 14:20:57 2018 >>> New Revision: 341578 >>> URL: https://svnweb.freebsd.org/changeset/base/341578 >>> >>> Log: >>> mlx5en: Remove the DRBR and associated logic in the transmit path. >>> The hardware queues are deep enough currently and using the >>> DRBR and associated >>> callbacks only leads to more task switching in the TX path. The is >>> also a race >>> setting the queue_state which can lead to hung TX rings. >> >> JFYI. We have compared the same router+firewall workloads on the host >> with this change and before, and I can say, that without DRBR on TX now >> we constantly have several percents of packets drops due to ENOBUFS >> error from mlx5e_xmit(). >> > > Have you tried to tune the TX/RX parameters? > > Especially the tx_queue_size . We use the following settings: % sysctl dev.mce.4.conf. | grep que dev.mce.4.conf.rx_queue_size: 16384 dev.mce.4.conf.tx_queue_size: 16384 dev.mce.4.conf.rx_queue_size_max: 16384 dev.mce.4.conf.tx_queue_size_max: 16384 Also, previously I have patched MLX5E_SQ_TX_QUEUE_SIZE value up to 16384. -- WBR, Andrey V. Elsukov signature.asc Description: OpenPGP digital signature
Re: svn commit: r341578 - head/sys/dev/mlx5/mlx5_en
On 2019-12-13 14:40, Andrey V. Elsukov wrote: On 05.12.2018 17:20, Slava Shwartsman wrote: Author: slavash Date: Wed Dec 5 14:20:57 2018 New Revision: 341578 URL: https://svnweb.freebsd.org/changeset/base/341578 Log: mlx5en: Remove the DRBR and associated logic in the transmit path. The hardware queues are deep enough currently and using the DRBR and associated callbacks only leads to more task switching in the TX path. The is also a race setting the queue_state which can lead to hung TX rings. JFYI. We have compared the same router+firewall workloads on the host with this change and before, and I can say, that without DRBR on TX now we constantly have several percents of packets drops due to ENOBUFS error from mlx5e_xmit(). Have you tried to tune the TX/RX parameters? Especially the tx_queue_size . --HPS ___ svn-src-head@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"
Re: svn commit: r341578 - head/sys/dev/mlx5/mlx5_en
On 05.12.2018 17:20, Slava Shwartsman wrote: > Author: slavash > Date: Wed Dec 5 14:20:57 2018 > New Revision: 341578 > URL: https://svnweb.freebsd.org/changeset/base/341578 > > Log: > mlx5en: Remove the DRBR and associated logic in the transmit path. > > The hardware queues are deep enough currently and using the DRBR and > associated > callbacks only leads to more task switching in the TX path. The is also a > race > setting the queue_state which can lead to hung TX rings. JFYI. We have compared the same router+firewall workloads on the host with this change and before, and I can say, that without DRBR on TX now we constantly have several percents of packets drops due to ENOBUFS error from mlx5e_xmit(). -- WBR, Andrey V. Elsukov signature.asc Description: OpenPGP digital signature
Re: svn commit: r341578 - head/sys/dev/mlx5/mlx5_en
On Wed, 19 Dec 2018, Bruce Evans wrote: On Wed, 19 Dec 2018, Bruce Evans wrote: On Mon, 17 Dec 2018, Andrew Gallatin wrote: On 12/17/18 2:08 PM, Bruce Evans wrote: * ... iflib uses queuing techniques to significantly pessimize em NICs with 1 hardware queue.?? On fast machines, it attempts to do 1 context switch per ... This can happen even w/o contention when "abdicate" is enabled in mp ring. I complained about this as well, and the default was changed in mp ring to not always "abdicate" (eg, switch to the tq to handle the packet). Abdication substantially pessimizes Netflix style web uncontended workloads, but it generally helps small packet forwarding. It is interesting that you see the opposite. I should try benchmarking with just a single ring. Hmm, I didn't remember "abdicated" and never knew about the sysctl for it (the sysctl is newer), but I notices the slowdown from near the first commit for it (r323954) and already used the folowing workaround for it: ... This essentialy just adds back the previous code with a flag to check both versions. Hopefully the sysctl can do the same thing. It doesn't. Setting tx_abdicate to 1 gives even more context switches (almost twice as many, 800k/sec instead of 400k/sec, on i386 pessimized by INVARIANTS, WITNESS, !WITNESS_SKIPSPIN, 4G KVA and more. Without ... I now understand most of the slownesses and variations in benchmarks. Short summary: After arcane tuning including a sysctl only available in my version of SCHED_4BSD, on amd64 iflib in -current runs as fast as old old em with EM_MULTIQUEUE and no other tuning in FreeBSD-11; i386 also needs a CPU almost 3 times faster to compensate for the overhead of having 4G KVA (bit no other security pessimizations in either). Long summary: iflib with tx_abdicate=0 runs a bit like old em without EM_MULTIQUEUE, provided the NIC is I218V and not PRO1000 and/or the CPU is too slow to saturate the NIC and/or the network. iflib is just 10% slower. Neither does excessive context switches to tgq with I218V (context switches seem to be limited to not much more than 2 per h/w interrupt, and h/w interrupts are normally moderated to 8kHz). However, iflib does excessive context switches for PRO1000. I don't know if this is for hardware reasons or just for dropping packets. iflib with tx_abdicate=1 runs a bit like old em with EM_MULTIQUEUE. Due to general slowness, even a 4GHz i7 has difficulty saturating 1Gbps ethernet with small packets. tx_abdicate=1 allows it to saturate by using tgq more. This causes lots of context switches and otherwise uses lots of CPU (60% of a 4GHz i7 for iflib). Old em with EM_MULTIQUEUE gives identical kpps and saturation and dropped packets for spare cycles on the CPU producing the packets, but I think it does less context switches and uses less CPU for tgq. This is mostly for the I218V. I got apparently-qualitativly-different results on i386 because I mostly tested i386 with the PRO1000 where there are excessive context switches on both amd64 and i386 with tx_abdicate=0. tx_abdicate=1 gives even more excessive context switches (about twice as many) for the PRO1000. I got apparently-qualitativly-different results for some old benchmarks because I used an old version of FreeBSD (r332488) for many of them, and also had version problems within this version. iflib in this version forces tx_abdicate=1. I noticed the extra context switches from this long ago, and had an option which defaulted to using older iflib code which seemed to work better. But I misedited the non-default case of this and had the double drainage check bug that was added in -current in r366560 and fixed in -current in r341824. This gave excessive extra context switches, so the commit that added abdication (r323954) seemed to be even slower than it was. The fastest case by a significant amount (saturation on I218V using 1.6 times less CPU) is with netblast bound to the same CPU as tgq, PREEMPTION* not configured, and my scheduler modification that reduces preemption even further, and this modification selected using a sysctl), and tx_abdicate=1. Then the scheduler modification delays most switches to tgq, and tx_abdicate=1 apparently allows such context switches when they are useful (I think netblast fills a queue and then tx_abdiscate=1 gives a context switch immediately, but tx_abdicate=0 doesn't give a context switch soon enough). But without the scheduler modification, this is the slowest case (tx_abdicate=1 forces context switches to tgq after every packet, and since netblast is bound to the same CPU, it can't run. In both cases, only 1 CPU is used, but the context switches reduce throughput by about a factor of 2. It is less clear why througput counting dropped packets is lower for netblast not bound and tx_abdicate=0. Then tgq apparently doesn't run promptly enough to saturate the net, but netblast has its own CPU so it doesn't stop when tgq runs so it should be able to produce even
Re: svn commit: r341578 - head/sys/dev/mlx5/mlx5_en
On Wed, 19 Dec 2018, Bruce Evans wrote: On Mon, 17 Dec 2018, Andrew Gallatin wrote: On 12/17/18 2:08 PM, Bruce Evans wrote: * ... iflib uses queuing techniques to significantly pessimize em NICs with 1 hardware queue.?? On fast machines, it attempts to do 1 context switch per ... This can happen even w/o contention when "abdicate" is enabled in mp ring. I complained about this as well, and the default was changed in mp ring to not always "abdicate" (eg, switch to the tq to handle the packet). Abdication substantially pessimizes Netflix style web uncontended workloads, but it generally helps small packet forwarding. It is interesting that you see the opposite. I should try benchmarking with just a single ring. Hmm, I didn't remember "abdicated" and never knew about the sysctl for it (the sysctl is newer), but I notices the slowdown from near the first commit for it (r323954) and already used the folowing workaround for it: ... This essentialy just adds back the previous code with a flag to check both versions. Hopefully the sysctl can do the same thing. It doesn't. Setting tx_abdicate to 1 gives even more context switches (almost twice as many, 800k/sec instead of 400k/sec, on i386 pessimized by INVARIANTS, WITNESS, !WITNESS_SKIPSPIN, 4G KVA and more. Without pessimizations it does 1M/sec instea of 400k/sec). The behaviour is easy to understand by watchomg top -SH -m io with netblast bound to the same CPU as the main tgq. Then netblast does involuntary context switches at the same rate that the tgq does voluntary context switches, and tx_abdicate=1 doubles this rare. netblast only switches at the quantum rate (11 per second) when not bound (I think it does null switches and it is a bug to count these as switches, but even null switches do too much). This is also without my usual default of !PREEMPTION && !IPI_PREEMPTION. Binding the netblast to the same CPU as the tgq only stops the excessive context switches wihen !PREEMPTION. My hack might depend on this too. Unfortunately, the hack is not in the same kernels as the sysctl, and I already have too many combinations to test. Another test with only 4G KVA (no INVARIANTS, etc., no PREEMPTION): tx_abdicate=0: tgq switch rate 997-1017k/sec (16k/sec if netblast bound) tx_abdicate=1: tgq switch rate 1300-1350k/sec (16k/sec if netblast bound) Another test on amd64 to escape i386 4G KVA pessimizations: tx_abdicate=0: tgq switch rate 1110-1220k/sec (16k/sec if netblast bound) tx_abdicate=1: tgq switch rate 1360-1430k/sec (16k/sec if netblast bound) When netblast is bound to the tgq's CPU, the tgq actually runs on another CPU. Apparently, the binding is weak ot this is a bugfeature in my scheduler. When tx_abdicate=1, the switch rate is close to the packet rate. Since the NIC can't keep up, most packets are dropped. On amd64 with tx_abdicate=1, the packet rates are: netblast bound: 313kpps sent, 1604kpps dropped netblast unbound: 253kpps sent, 1153kpps dropped 253kpps sent is bad. This indicates large latencies (not due to !PREEMPTION or secheduler bugs AFAIK). Most tests with netblast unbound seemed to saturate the NIC at 280kpps (but the tests with netblast bound shows that the NIC can go a little faster). Even an old 2GHz CPU can reach 280kpps. This shows another problem with taskqueues. It takes context switches just to decide to drop packets. Previous versions of iflib were much slower at dropping packets. Some had rates closer to the low send rate than the 1604kpps achieved above. FreeBSD-5 running on a single 3 times slower CPU can drop packets at 2124kpps, mainly by dropping them in ip_output() after peeking at the software ifqs to see that there is no space. IFF_MONITOR gives better tests of the syscall overhead. Another test with amd64 and I218V instead of PRO1000: netblast bound, !abdicate: 1243kpps sent, 0kpps dropped (16k/sec csw) netblast unbound, !abdicate: 1236kpps sent, 0kpps dropped (16k/sec csw) netblast bound, abdicate:1485kpps sent, 243kpps dropped (16k/sec csw) netblast unbound, abdicate: 1407kpps sent, 1.7kpps dropped (850k/sec csw) There is an i386 dependency after all! !abdicate works on amd64 but not on i386 to prevent the excessive context switches. Unfortunately, it also reduces kpps by almost 20% and leaves no spare CPU for dropping packets. The best case of netblast bound, abdicate is competitive with FreeBSD-11 on i386 with EM_MULTIQUEUE: above result repeated: netblast bound, abdicate:1485kpps sent, 243kpps dropped (16k/sec csw) previous best result: FBSD-11 SMP-8 1486+241 # no iflib, use EM_MULTIQUEUE (now saturate 1Gbps) (this is without PREEMPTION* and without binding netblast). The above for -current also has the lowest possible CPU use (100% of 1 CPU for all threads, while netblast unbound takes 100% of 1 CPU for netblast and 60% of another CPU for tgq), and I think the FBSD=11 case takes 100% of 1 CPU for netblast unbound and a tiny% of another CPU fo
Re: svn commit: r341578 - head/sys/dev/mlx5/mlx5_en
On Mon, 17 Dec 2018, Andrew Gallatin wrote: On 12/17/18 2:08 PM, Bruce Evans wrote: On Mon, 17 Dec 2018, Andrew Gallatin wrote: On 12/5/18 9:20 AM, Slava Shwartsman wrote: Author: slavash Date: Wed Dec?? 5 14:20:57 2018 New Revision: 341578 URL: https://urldefense.proofpoint.com/v2/url?u=https-3A__svnweb.freebsd.org_changeset_base_341578&d=DwIDaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=Ed-falealxPeqc22ehgAUCLh8zlZbibZLSMWJeZro4A&m=BFp2c_-S0jnzRZJF2APwvTwmnmVFcyjcnBvHRZ3Locc&s=b7fvhOzf_b5bMVGquu4SaBhMNql5N8dVPAvpfKtz53Q&e= Log: mlx5en: Remove the DRBR and associated logic in the transmit path. ?? The hardware queues are deep enough currently and using the DRBR and associated callbacks only leads to more task switching in the TX path. The is also a race setting the queue_state which can lead to hung TX rings. The point of DRBR in the tx path is not simply to provide a software ring for queuing excess packets.?? Rather it provides a mechanism to avoid lock contention by shoving a packet into the software ring, where it will later be found & processed, rather than blocking the caller on a mtx lock. I'm concerned you may have introduced a performance regression for use cases where you have N:1?? or N:M lock contention where many threads on different cores are contending for the same tx queue.?? The state of the art for this is no longer DRBR, but mp_ring, as used by both cxgbe and iflib. iflib uses queuing techniques to significantly pessimize em NICs with 1 hardware queue.?? On fast machines, it attempts to do 1 context switch per Bruce Evans didn't write that. Some mail program converted 2-space sentence breaks to \xc2\xa0. This can happen even w/o contention when "abdicate" is enabled in mp ring. I complained about this as well, and the default was changed in mp ring to not always "abdicate" (eg, switch to the tq to handle the packet). Abdication substantially pessimizes Netflix style web uncontended workloads, but it generally helps small packet forwarding. It is interesting that you see the opposite. I should try benchmarking with just a single ring. Hmm, I didn't remember "abdicated" and never knew about the sysctl for it (the sysctl is newer), but I notices the slowdown from near the first commit for it (r323954) and already used the folowing workaround for it: XX Index: iflib.c XX === XX --- iflib.c (revision 332488) XX +++ iflib.c (working copy) XX @@ -1,3 +1,5 @@ XX +int bde_oldnet = 1; XX + XX /*- XX * Copyright (c) 2014-2018, Matthew Macy XX * All rights reserved. XX @@ -3650,9 +3652,17 @@ XX IFDI_TX_QUEUE_INTR_ENABLE(ctx, txq->ift_id); XX return; XX } XX +if (bde_oldnet) { XX if (txq->ift_db_pending) XX ifmp_ring_enqueue(txq->ift_br, (void **)&txq, 1, TX_BATCH_SIZE); XX +else XX +ifmp_ring_check_drainage(txq->ift_br, TX_BATCH_SIZE); XX +} else { XX +if (txq->ift_db_pending) XX +ifmp_ring_enqueue(txq->ift_br, (void **)&txq, 1, TX_BATCH_SIZE); XX ifmp_ring_check_drainage(txq->ift_br, TX_BATCH_SIZE); XX +} XX +ifmp_ring_check_drainage(txq->ift_br, TX_BATCH_SIZE); XX if (ctx->ifc_flags & IFC_LEGACY) XX IFDI_INTR_ENABLE(ctx); XX else { XX @@ -3862,8 +3872,11 @@ XX DBG_COUNTER_INC(tx_seen); XX err = ifmp_ring_enqueue(txq->ift_br, (void **)&m, 1, TX_BATCH_SIZE); XX XX +if (!bde_oldnet) XX GROUPTASK_ENQUEUE(&txq->ift_task); XX if (err) { XX +if (bde_oldnet) XX +GROUPTASK_ENQUEUE(&txq->ift_task); XX /* support forthcoming later */ XX #ifdef DRIVER_BACKPRESSURE XX txq->ift_closed = TRUE; XX @@ -3870,6 +3883,9 @@ XX #endif XX ifmp_ring_check_drainage(txq->ift_br, TX_BATCH_SIZE); XX m_freem(m); XX +} else if (TXQ_AVAIL(txq) < (txq->ift_size >> 1)) { XX +if (bde_oldnet) XX +GROUPTASK_ENQUEUE(&txq->ift_task); XX } XX XX return (err); XX Index: mp_ring.c XX === XX --- mp_ring.c(revision 332488) XX +++ mp_ring.c(working copy) XX @@ -1,3 +1,11 @@ XX +#include "opt_pci.h" XX + XX +#ifdef DEV_PCI XX +extern int bde_oldnet; XX +#else XX +#define bde_oldnet 0 XX +#endif XX + XX /*- XX * Copyright (c) 2014 Chelsio Communications, Inc. XX * All rights reserved. XX @@ -454,12 +462,25 @@ XX do { XX os.state = ns.state = r->state; XX ns.pidx_tail = pidx_stop; XX +if (bde_oldnet) XX +ns.flags = BUSY; XX +else { XX if (os.flags == IDLE) XX ns.flags = ABDICATED; XX +} XX } while (atomic_cmpset_rel_64(&r->state, os.state, ns.state) == 0); XX critical_exit(); XX counter_u64_add(r->enqueues, n); XX XX +if (bde_oldnet) { XX +/* XX + * Turn into a consumer if some other thread i
Re: svn commit: r341578 - head/sys/dev/mlx5/mlx5_en
On Mon, 17 Dec 2018 14:50:04 -0500 Andrew Gallatin wrote: > On 12/17/18 2:08 PM, Bruce Evans wrote: [snip] > > iflib uses queuing techniques to significantly pessimize em NICs with 1 > > hardware queue. On fast machines, it attempts to do 1 context switch per > > This can happen even w/o contention when "abdicate" is enabled in mp > ring. I complained about this as well, and the default was changed in > mp ring to not always "abdicate" (eg, switch to the tq to handle the > packet). Abdication substantially pessimizes Netflix style web > uncontended workloads, but it generally helps small packet forwarding. > > It is interesting that you see the opposite. I should try benchmarking > with just a single ring. > Why are iflib and ifdi compiled into EVERY kernel with device ether and/or device pci when only a few NICs actually use iflib? This is really unnecessary bloat in an already bloated kernel. I use if_re which does not use iflib. I removed iflib and ifdi from /sys/conf/files and my network still works just fine. It seems to me like these iflib entries need finer-grained options, e.g. one of the NICs which use iflib is enabled, before pulling them into the kernel build. -- Gary Jennejohn ___ svn-src-head@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"
Re: svn commit: r341578 - head/sys/dev/mlx5/mlx5_en
On 12/17/18 2:08 PM, Bruce Evans wrote: On Mon, 17 Dec 2018, Andrew Gallatin wrote: On 12/5/18 9:20 AM, Slava Shwartsman wrote: Author: slavash Date: Wed Dec 5 14:20:57 2018 New Revision: 341578 URL: https://urldefense.proofpoint.com/v2/url?u=https-3A__svnweb.freebsd.org_changeset_base_341578&d=DwIDaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=Ed-falealxPeqc22ehgAUCLh8zlZbibZLSMWJeZro4A&m=BFp2c_-S0jnzRZJF2APwvTwmnmVFcyjcnBvHRZ3Locc&s=b7fvhOzf_b5bMVGquu4SaBhMNql5N8dVPAvpfKtz53Q&e= Log: mlx5en: Remove the DRBR and associated logic in the transmit path. The hardware queues are deep enough currently and using the DRBR and associated callbacks only leads to more task switching in the TX path. The is also a race setting the queue_state which can lead to hung TX rings. The point of DRBR in the tx path is not simply to provide a software ring for queuing excess packets. Rather it provides a mechanism to avoid lock contention by shoving a packet into the software ring, where it will later be found & processed, rather than blocking the caller on a mtx lock. I'm concerned you may have introduced a performance regression for use cases where you have N:1 or N:M lock contention where many threads on different cores are contending for the same tx queue. The state of the art for this is no longer DRBR, but mp_ring, as used by both cxgbe and iflib. iflib uses queuing techniques to significantly pessimize em NICs with 1 hardware queue. On fast machines, it attempts to do 1 context switch per This can happen even w/o contention when "abdicate" is enabled in mp ring. I complained about this as well, and the default was changed in mp ring to not always "abdicate" (eg, switch to the tq to handle the packet). Abdication substantially pessimizes Netflix style web uncontended workloads, but it generally helps small packet forwarding. It is interesting that you see the opposite. I should try benchmarking with just a single ring. (small) tx packet and can't keep up. On slow machines it has a chance of handling multiple packets per context switch, but since the machine is too slow it can't keep up and saturates at a slightly different point. Results for netblast $lanhost 5001 5 10 (5-byte payload for 10 seconds) on an I218V on Haswell 4 cores x 2 threads @4.08GHz running i386: Old results with no iflib and no EM_MULTIQUEUE except as indicated: FBSD-10 UP 1377+0 FBSD-11 UP 1326+0 FBSD-11 SMP-1 1484+0 FBSD-11 SMP-8 1395+0 FBSD-12mod SMP-1 1386+0 FBSD-12mod SMP-8 1422+0 FBSD-12mod SMP-1 1270+0 # use iflib (lose 8% performance) FBSD-12mod SMP-8 1279+0 # use iflib (lose 10% performance using more CPU) 1377+0 means 1377 kpps sent and 0 kpps errors, etc. SMP-8 means use all 8 CPUs. SMP-1 means restrict netblast to 1 CPU different from the taskqueue CPUs using cpuset. New results: FBSD-11 SMP-8 1440+0 # no iflib, no EM_MULTIQUEUE FBSD-11 SMP-8 1486+241 # no iflib, use EM_MULTIQUEUE (now saturate 1Gbps) FBSD-cur SMP-8 533+0 # use iflib, use i386 with 4G KVA iflib only decimates performance relative to the FreeBSD-11 version with no EM_MULTIQUEUE, but EM_MULTIQUEUE gives better queueing using more CPUs. This gives the extra 10-20% of performance needed to saturate the NIC and 1Gbps ethernet. The FreeBSD-current version is not directly comparable since using 4G KVA on i386 reduces performance by about a factor of 2.5 for all loads with mostly small i/o's (for 128K disk i/o's the reduction is only 10-20%). i386 ran at about the same speed as amd64 when it had 1GB KVA, but I don't have any savd results for amd64 to compare with precisely). This is all with security-related things like ibrs unavailable or turned off. All versions use normal Intel interrupt moderation which gives an interrupt rate of 8k/sec. Old versions of em use a "fast" interrupt handler and a slow switch to a taskqueue. This gives a contex switch rate of about 16k/ sec. In the SMP case, netblast normally runs on another CPU and I think it fills h/w tx queue(s) synchronously, and the taskqueue only does minor cleanups. Old em also has a ping latency of about 10% smaller than with iflib (73 usec instead of 80 usec after setting em.x.itr to 0 and other tuning to kill interrupt moderation, and similar for a bge NIC on the other end). The synchronous queue filling probably improves latency, but it is hard to see how it makes a difference of more than 1 usec. 73 is already too high. An old PRO1000 Intel NIC has a latency of only 50 usec on the same network. The switch costs about 20 usec of this. iflib uses taskqueue more. netblast normally runs on another CPU and I think it only fills s/w tx queue(s) synchronously, and wakes up the taskqueues for every packet. The CPUs are almost fast enough to keep up, and the system does about 1M context switches for this (in versions other than i386 with 4G KVA). That is slightly more than 2 packets per switch to
Re: svn commit: r341578 - head/sys/dev/mlx5/mlx5_en
On Mon, 17 Dec 2018, Andrew Gallatin wrote: On 12/5/18 9:20 AM, Slava Shwartsman wrote: Author: slavash Date: Wed Dec 5 14:20:57 2018 New Revision: 341578 URL: https://urldefense.proofpoint.com/v2/url?u=https-3A__svnweb.freebsd.org_changeset_base_341578&d=DwIDaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=Ed-falealxPeqc22ehgAUCLh8zlZbibZLSMWJeZro4A&m=BFp2c_-S0jnzRZJF2APwvTwmnmVFcyjcnBvHRZ3Locc&s=b7fvhOzf_b5bMVGquu4SaBhMNql5N8dVPAvpfKtz53Q&e= Log: mlx5en: Remove the DRBR and associated logic in the transmit path. The hardware queues are deep enough currently and using the DRBR and associated callbacks only leads to more task switching in the TX path. The is also a race setting the queue_state which can lead to hung TX rings. The point of DRBR in the tx path is not simply to provide a software ring for queuing excess packets. Rather it provides a mechanism to avoid lock contention by shoving a packet into the software ring, where it will later be found & processed, rather than blocking the caller on a mtx lock. I'm concerned you may have introduced a performance regression for use cases where you have N:1 or N:M lock contention where many threads on different cores are contending for the same tx queue. The state of the art for this is no longer DRBR, but mp_ring, as used by both cxgbe and iflib. iflib uses queuing techniques to significantly pessimize em NICs with 1 hardware queue. On fast machines, it attempts to do 1 context switch per (small) tx packet and can't keep up. On slow machines it has a chance of handling multiple packets per context switch, but since the machine is too slow it can't keep up and saturates at a slightly different point. Results for netblast $lanhost 5001 5 10 (5-byte payload for 10 seconds) on an I218V on Haswell 4 cores x 2 threads @4.08GHz running i386: Old results with no iflib and no EM_MULTIQUEUE except as indicated: FBSD-10 UP1377+0 FBSD-11 UP1326+0 FBSD-11 SMP-1 1484+0 FBSD-11 SMP-8 1395+0 FBSD-12mod SMP-1 1386+0 FBSD-12mod SMP-8 1422+0 FBSD-12mod SMP-1 1270+0 # use iflib (lose 8% performance) FBSD-12mod SMP-8 1279+0 # use iflib (lose 10% performance using more CPU) 1377+0 means 1377 kpps sent and 0 kpps errors, etc. SMP-8 means use all 8 CPUs. SMP-1 means restrict netblast to 1 CPU different from the taskqueue CPUs using cpuset. New results: FBSD-11 SMP-8 1440+0 # no iflib, no EM_MULTIQUEUE FBSD-11 SMP-8 1486+241 # no iflib, use EM_MULTIQUEUE (now saturate 1Gbps) FBSD-curSMP-8 533+0 # use iflib, use i386 with 4G KVA iflib only decimates performance relative to the FreeBSD-11 version with no EM_MULTIQUEUE, but EM_MULTIQUEUE gives better queueing using more CPUs. This gives the extra 10-20% of performance needed to saturate the NIC and 1Gbps ethernet. The FreeBSD-current version is not directly comparable since using 4G KVA on i386 reduces performance by about a factor of 2.5 for all loads with mostly small i/o's (for 128K disk i/o's the reduction is only 10-20%). i386 ran at about the same speed as amd64 when it had 1GB KVA, but I don't have any savd results for amd64 to compare with precisely). This is all with security-related things like ibrs unavailable or turned off. All versions use normal Intel interrupt moderation which gives an interrupt rate of 8k/sec. Old versions of em use a "fast" interrupt handler and a slow switch to a taskqueue. This gives a contex switch rate of about 16k/ sec. In the SMP case, netblast normally runs on another CPU and I think it fills h/w tx queue(s) synchronously, and the taskqueue only does minor cleanups. Old em also has a ping latency of about 10% smaller than with iflib (73 usec instead of 80 usec after setting em.x.itr to 0 and other tuning to kill interrupt moderation, and similar for a bge NIC on the other end). The synchronous queue filling probably improves latency, but it is hard to see how it makes a difference of more than 1 usec. 73 is already too high. An old PRO1000 Intel NIC has a latency of only 50 usec on the same network. The switch costs about 20 usec of this. iflib uses taskqueue more. netblast normally runs on another CPU and I think it only fills s/w tx queue(s) synchronously, and wakes up the taskqueues for every packet. The CPUs are almost fast enough to keep up, and the system does about 1M context switches for this (in versions other than i386 with 4G KVA). That is slightly more than 2 packets per switch to get the speed of 1279 kpps. netblast uses 100% of 1 CPU but the taskqueues don't saturate their CPUs although they should so as to do even more context switches. They still use a lot of CPU (about 50% of 1 CPU more than in old em). These context switches lose by doing the opposite of interrupt moderation. I can "fix" the extra context switches and restore some of the lost performance and most of the lost CPU by running netblast on the same CPU as the main taskqueue (and using my normal
Re: svn commit: r341578 - head/sys/dev/mlx5/mlx5_en
On 12/5/18 9:20 AM, Slava Shwartsman wrote: Author: slavash Date: Wed Dec 5 14:20:57 2018 New Revision: 341578 URL: https://urldefense.proofpoint.com/v2/url?u=https-3A__svnweb.freebsd.org_changeset_base_341578&d=DwIDaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=Ed-falealxPeqc22ehgAUCLh8zlZbibZLSMWJeZro4A&m=BFp2c_-S0jnzRZJF2APwvTwmnmVFcyjcnBvHRZ3Locc&s=b7fvhOzf_b5bMVGquu4SaBhMNql5N8dVPAvpfKtz53Q&e= Log: mlx5en: Remove the DRBR and associated logic in the transmit path. The hardware queues are deep enough currently and using the DRBR and associated callbacks only leads to more task switching in the TX path. The is also a race setting the queue_state which can lead to hung TX rings. The point of DRBR in the tx path is not simply to provide a software ring for queuing excess packets. Rather it provides a mechanism to avoid lock contention by shoving a packet into the software ring, where it will later be found & processed, rather than blocking the caller on a mtx lock. I'm concerned you may have introduced a performance regression for use cases where you have N:1 or N:M lock contention where many threads on different cores are contending for the same tx queue. The state of the art for this is no longer DRBR, but mp_ring, as used by both cxgbe and iflib. For well behaved workloads (like Netflix's), I don't anticipate this being a performance issue. However, I worry that this will impact other workloads and that you should consider running some testing of N:1 contention. Eg, 128 netperfs running in parallel with only a few nic tx rings. Sorry for the late reply.. I'm behind on my -committers email. If you have not already MFC'ed this, you may want to reconsider. Drew ___ svn-src-head@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"
svn commit: r341578 - head/sys/dev/mlx5/mlx5_en
Author: slavash Date: Wed Dec 5 14:20:57 2018 New Revision: 341578 URL: https://svnweb.freebsd.org/changeset/base/341578 Log: mlx5en: Remove the DRBR and associated logic in the transmit path. The hardware queues are deep enough currently and using the DRBR and associated callbacks only leads to more task switching in the TX path. The is also a race setting the queue_state which can lead to hung TX rings. Submitted by: hselasky@ Approved by:hselasky (mentor) MFC after: 1 week Sponsored by: Mellanox Technologies Modified: head/sys/dev/mlx5/mlx5_en/en.h head/sys/dev/mlx5/mlx5_en/mlx5_en_ethtool.c head/sys/dev/mlx5/mlx5_en/mlx5_en_main.c head/sys/dev/mlx5/mlx5_en/mlx5_en_tx.c Modified: head/sys/dev/mlx5/mlx5_en/en.h == --- head/sys/dev/mlx5/mlx5_en/en.h Wed Dec 5 14:20:26 2018 (r341577) +++ head/sys/dev/mlx5/mlx5_en/en.h Wed Dec 5 14:20:57 2018 (r341578) @@ -473,7 +473,6 @@ struct mlx5e_params { m(+1, u64 tx_coalesce_usecs, "tx_coalesce_usecs", "Limit in usec for joining tx packets") \ m(+1, u64 tx_coalesce_pkts, "tx_coalesce_pkts", "Maximum number of tx packets to join") \ m(+1, u64 tx_coalesce_mode, "tx_coalesce_mode", "0: EQE mode 1: CQE mode") \ - m(+1, u64 tx_bufring_disable, "tx_bufring_disable", "0: Enable bufring 1: Disable bufring") \ m(+1, u64 tx_completion_fact, "tx_completion_fact", "1..MAX: Completion event ratio") \ m(+1, u64 tx_completion_fact_max, "tx_completion_fact_max", "Maximum completion event ratio") \ m(+1, u64 hw_lro, "hw_lro", "set to enable hw_lro") \ @@ -606,8 +605,6 @@ struct mlx5e_sq { struct mlx5e_sq_stats stats; struct mlx5e_cq cq; - struct task sq_task; - struct taskqueue *sq_tq; /* pointers to per packet info: write@xmit, read@completion */ struct mlx5e_sq_mbuf *mbuf; @@ -628,7 +625,6 @@ struct mlx5e_sq { struct mlx5_wq_ctrl wq_ctrl; struct mlx5e_priv *priv; int tc; - unsigned int queue_state; } __aligned(MLX5E_CACHELINE_SIZE); static inline bool @@ -857,7 +853,6 @@ voidmlx5e_cq_error_event(struct mlx5_core_cq *mcq, in void mlx5e_rx_cq_comp(struct mlx5_core_cq *); void mlx5e_tx_cq_comp(struct mlx5_core_cq *); struct mlx5_cqe64 *mlx5e_get_cqe(struct mlx5e_cq *cq); -void mlx5e_tx_que(void *context, int pending); intmlx5e_open_flow_table(struct mlx5e_priv *priv); void mlx5e_close_flow_table(struct mlx5e_priv *priv); Modified: head/sys/dev/mlx5/mlx5_en/mlx5_en_ethtool.c == --- head/sys/dev/mlx5/mlx5_en/mlx5_en_ethtool.c Wed Dec 5 14:20:26 2018 (r341577) +++ head/sys/dev/mlx5/mlx5_en/mlx5_en_ethtool.c Wed Dec 5 14:20:57 2018 (r341578) @@ -703,18 +703,6 @@ mlx5e_ethtool_handler(SYSCTL_HANDLER_ARGS) mlx5e_open_locked(priv->ifp); break; - case MLX5_PARAM_OFFSET(tx_bufring_disable): - /* rangecheck input value */ - priv->params_ethtool.tx_bufring_disable = - priv->params_ethtool.tx_bufring_disable ? 1 : 0; - - /* reconfigure the sendqueues, if any */ - if (was_opened) { - mlx5e_close_locked(priv->ifp); - mlx5e_open_locked(priv->ifp); - } - break; - case MLX5_PARAM_OFFSET(tx_completion_fact): /* network interface must be down */ if (was_opened) Modified: head/sys/dev/mlx5/mlx5_en/mlx5_en_main.c == --- head/sys/dev/mlx5/mlx5_en/mlx5_en_main.cWed Dec 5 14:20:26 2018 (r341577) +++ head/sys/dev/mlx5/mlx5_en/mlx5_en_main.cWed Dec 5 14:20:57 2018 (r341578) @@ -1,5 +1,5 @@ /*- - * Copyright (c) 2015 Mellanox Technologies. All rights reserved. + * Copyright (c) 2015-2018 Mellanox Technologies. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions @@ -1184,37 +1184,6 @@ mlx5e_create_sq(struct mlx5e_channel *c, sq->min_inline_mode = priv->params.tx_min_inline_mode; sq->vlan_inline_cap = MLX5_CAP_ETH(mdev, wqe_vlan_insert); - /* check if we should allocate a second packet buffer */ - if (priv->params_ethtool.tx_bufring_disable == 0) { - sq->br = buf_ring_alloc(MLX5E_SQ_TX_QUEUE_SIZE, M_MLX5EN, - M_WAITOK, &sq->lock); - if (sq->br == NULL) { - if_printf(c->ifp, "%s: Failed allocating sq drbr buffer\n", - __func__); - err = -ENOMEM; - goto err_free_sq_db; - } - - sq->sq_tq = task