One more interesting point.

It seems a code at start of udp_master() function
knot-dns/src/knot/server/udp-handler.c

        unsigned cpu = dt_online_cpus();
        if (cpu > 1) {
                unsigned cpu_mask = (dt_get_id(thread) % cpu);
                dt_setaffinity(thread, &cpu_mask, 1);
        }

is not effective at all for Elbrus architecture.

Without thread-to-cpu bound by shell "taskset" command i see no empty
cores.

On Thu, 2019-04-04 at 12:19 +0300, Sergey Petrov wrote:
> Lets start step-by-step
> 
> On Wed, 2019-04-03 at 16:56 +0200, Petr Špaček wrote:
> > Hello,
> > 
> > as you already found out it is complicated ;-)
> > 
> > Linux kernel has its own magic algorithms to schedule work on
> > multi-core/multi-socket/NUMA machines and it DNS benchmarking also very
> > much depends on network card, its drivers etc.
> > 
> > If we were going to fine-tune your setup we would have to go into details:
> > 
> > What is your CPU architecture? Number of sockets, CPU in them etc.?
> 
> It's russian Elbrus CPU, i have a little info about it architecture. 4
> socket motherboard with 8-core CPU at each socket.
> 
> elbrus01 ~/src/dnsperf # uname -a
> Linux elbrus01 4.9.0-2.2-e8c #1 SMP Mon Nov 12 10:52:48 GMT 2018 e2k E8C
> E8C-SWTX GNU/Linux
> elbrus01 ~/src/dnsperf # cat /etc/mcst_version 
> 4.0-rc2
> elbrus01 ~/src/dnsperf # 
> 
> > How is operating memory connected to CPUs?
> > Is it NUMA?
> 
> I think it is NUMA. I can see some memory scew across numa nodes
> 
> elbrus01 ~/src/dnsperf # numactl --show
> policy: default
> preferred node: current
> physcpubind: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 32 33 34 35 36 37
> 38 39 48 49 50 51 52 53 54 55 
> cpubind: 0 1 2 3 
> nodebind: 0 1 2 3 
> membind: 0 1 2 3 
> elbrus01 ~/src/dnsperf # numactl --hardware
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3 4 5 6 7
> node 0 size: 64433 MB
> node 0 free: 63261 MB
> node 1 cpus: 16 17 18 19 20 21 22 23
> node 1 size: 64467 MB
> node 1 free: 62363 MB
> node 2 cpus: 32 33 34 35 36 37 38 39
> node 2 size: 64467 MB
> node 2 free: 63768 MB
> node 3 cpus: 48 49 50 51 52 53 54 55
> node 3 size: 64467 MB
> node 3 free: 63811 MB
> node distances:
> node   0   1   2   3 
>   0:  10  20  20  20 
>   1:  20  10  20  20 
>   2:  20  20  10  20 
>   3:  20  20  20  10 
> elbrus01 ~/src/dnsperf #
> 
> 
> 
> > Do you have irqbalance enabled?
> 
> I try to use irqbalance on previous OperatinSystem based on 3.11 Linux
> kernel. New one OS based on 4.9 Linux kernel has no irqbalance binary at
> all.
> 
> > Have you somehow configured IRQ affinity?
> > What is your network card (how many IO queues it has)?
> 
> Network card is Intel 540 10Gbe PCI card. PIC lines are connected to
> CPU0 socket. NSD prefer all the IRQ bounded to CPU0, memory allocated at
> numa-node0 AND workers bound to CPU0/CPU1. KNOT prefer IRQ spreaded
> across all 4 CPUs and workers using all CPUs.
> 
> > Did you configure network card queues and other driver settings explicitly?
> > etc.
> 
> I spread incoming UDP across all 32 RX-queues
> 
> elbrus01 ~/src/dnsperf # ethtool -N eth4 rx-flow-hash udp4 sdfn
> 
> And affine each net4 IRQ to its personal CPU core
> 
> elbrus01 ~/src/dnsperf #  echo 00800000,00000000
> > /proc/irq/120/smp_affinity
> elbrus01 ~/src/dnsperf #  echo 00400000,00000000
> > /proc/irq/119/smp_affinity
> elbrus01 ~/src/dnsperf #  echo 00200000,00000000
> > /proc/irq/118/smp_affinity
> elbrus01 ~/src/dnsperf #  echo 00100000,00000000
> > /proc/irq/117/smp_affinity
> elbrus01 ~/src/dnsperf #  echo 00080000,00000000
> > /proc/irq/116/smp_affinity
> elbrus01 ~/src/dnsperf #  echo 00040000,00000000
> > /proc/irq/115/smp_affinity
> elbrus01 ~/src/dnsperf #  echo 00020000,00000000
> > /proc/irq/114/smp_affinity
> elbrus01 ~/src/dnsperf #  echo 00010000,00000000
> > /proc/irq/113/smp_affinity
> elbrus01 ~/src/dnsperf #  echo 00000080,00000000
> > /proc/irq/112/smp_affinity
> elbrus01 ~/src/dnsperf #  echo 00000040,00000000
> > /proc/irq/111/smp_affinity
> elbrus01 ~/src/dnsperf #  echo 00000020,00000000
> > /proc/irq/110/smp_affinity
> elbrus01 ~/src/dnsperf #  echo 00000010,00000000
> > /proc/irq/109/smp_affinity
> elbrus01 ~/src/dnsperf #  echo 00000008,00000000
> > /proc/irq/108/smp_affinity
> elbrus01 ~/src/dnsperf #  echo 00000004,00000000
> > /proc/irq/107/smp_affinity
> elbrus01 ~/src/dnsperf #  echo 00000002,00000000
> > /proc/irq/106/smp_affinity
> elbrus01 ~/src/dnsperf #  echo 00000001,00000000
> > /proc/irq/105/smp_affinity
> elbrus01 ~/src/dnsperf #  echo 00800000 > /proc/irq/104/smp_affinity
> elbrus01 ~/src/dnsperf #  echo 00400000 > /proc/irq/103/smp_affinity
> elbrus01 ~/src/dnsperf #  echo 00200000 > /proc/irq/102/smp_affinity
> elbrus01 ~/src/dnsperf #  echo 00100000 > /proc/irq/101/smp_affinity
> elbrus01 ~/src/dnsperf #  echo 00080000 > /proc/irq/100/smp_affinity
> elbrus01 ~/src/dnsperf #  echo 00040000 > /proc/irq/99/smp_affinity
> elbrus01 ~/src/dnsperf #  echo 00020000 > /proc/irq/98/smp_affinity
> elbrus01 ~/src/dnsperf #  echo 00010000 > /proc/irq/97/smp_affinity
> elbrus01 ~/src/dnsperf #  echo 00000080 > /proc/irq/96/smp_affinity
> elbrus01 ~/src/dnsperf #  echo 00000040 > /proc/irq/95/smp_affinity
> elbrus01 ~/src/dnsperf #  echo 00000020 > /proc/irq/94/smp_affinity
> elbrus01 ~/src/dnsperf #  echo 00000010 > /proc/irq/93/smp_affinity
> elbrus01 ~/src/dnsperf #  echo 00000008 > /proc/irq/92/smp_affinity
> elbrus01 ~/src/dnsperf #  echo 00000004 > /proc/irq/91/smp_affinity
> elbrus01 ~/src/dnsperf #  echo 00000002 > /proc/irq/90/smp_affinity
> elbrus01 ~/src/dnsperf #  echo 00000001 > /proc/irq/89/smp_affinity
> 
> A /proc/interrupt output is attached
> 
> The load are starts with 
> 
> [nikor@kaa5 dnsperf]$ ./dnsperf -s 10.0.0.4 -d out -n 20 -c 103 -T72  -t
> 500 -S 1 -q 1000 -D
> 
> Different runs shows random numer of unused cores from 4 to 1. The
> request-per-seconds changes accordingly. Less unused cores get more
> performance. 
> 
> > 
> > 
> > Fine tunning always has to take into account your specific environment
> > and it is hard to provide general advice.
> > 
> > If you find specific reproducible problem please report it to our Gitlab:
> > https://gitlab.labs.nic.cz/knot/knot-dns/issues/
> > 
> > Please understand that amount of time and hardware we can allocate for
> > free support is limited. In case you require fine-tunning for your
> > specific deployment please consider byuing professional support:
> > https://www.knot-dns.cz/support/
> > 
> 
> It is reproducible case but in very specific environment. The overall
> result may be good enought for production usage. At this case
> professional support will be good option.
> 
> Myself interested in such strange behavior. Hope this  case may be
> usefull for you to. I will resend this thread to OS developers. May be
> they can clear this issue.
> 
> Thank you for you attention.
> 
> > Thank you for understanding.
> > Petr Špaček  @  CZ.NIC
> > 
> > 
> > On 03. 04. 19 16:37, Sergey Petrov wrote:
> > > I reverse the client and the server. So the server now is 36-cores intel
> > > box (72 HT-core)
> > > 
> > > Starting with small loads i see knot use lower cores except core-0.
> > > When adding more load i see cores 0-17 AND 37-54 are used but not to
> > > 100% level. At maximum load i see all cores are about 100% used.
> > > 
> > > It seems to me as system scheduler feature. First it starts with lower
> > > number cores, then add cores from second CPU socket and after all
> > > HT-cores.
> > > 
> > > On Wed, 2019-04-03 at 12:53 +0300, Sergey Petrov wrote:
> > >> On Wed, 2019-04-03 at 10:52 +0200, Petr Špaček wrote:
> > >>> On 03. 04. 19 10:45, Sergey Petrov wrote:
> > >>>> I perfoms benchmarks with knot-dns as a authoritative server and 
> > >>>> dnsperf
> > >>>> as a workload client. Knot server has 32 cores. Interrupts from 10Gb
> > >>>> network card are spreaded across all 32 cores. Knot configured with 
> > >>>> 64 udp-workers. Each knot thread assigned to one core. So there are at
> > >>>> least two knot threads assigned to one core. Then i start dnsperf with
> > >>>> command
> > >>>>
> > >>>> ./dnsperf -s 10.0.0.4 -d out -n 20 -c 103 -T 64  -t 500 -S 1 -q 1000 -D
> > >>>>
> > >>>> htop on knot server shows 3-4 cores completly unused. Then i restart
> > >>>> dnsperf unused cores are changes.
> > >>>>
> > >>>> That is the reason for unused core?
> > >>>
> > >>> Well, sometimes dnsperf is too slow :-)
> > >>>
> > >>> I recommend to check this:
> > >>> - Make sure dnsperf ("source machine") is not 100 % utilized.
> > >>> - Try to increase number of sockets used by dnsperf, i.e. -c parameter.
> > >>> I would try also values like 500 and 1000 to see if it makes any
> > >>> difference. It might change results significantly because Linux kernel
> > >>> is using hashes over some packet fields and low number of sockets might
> > >>> result in uneven query distribution.
> > >>>
> > >>> Please let us know what are your new results.
> > >>>
> > >>
> > >> The source machne is about 15% utilized. 
> > >>
> > >> ./dnsperf -s 10.0.0.4 -d out -n 20 -c 512 -T 512  -t 500 -S 1 -q 1000 -D
> > >>
> > >> get us some performance penalty (260000 rps VS 310000 rps) and more even
> > >> distribution across all cores with 100% usages of all eight cores on
> > >> last CPU socket. While other CPU socket cores are aproximately 60%
> > >> loaded.
> > >>
> > >> Using "-c 1000 -T 1000" parameters of dnsperf i see practicaly the same
> > >> core load distribution and even more performance penalty.
> > >>
> > >> Using "-c 16 -T 16" parameters i see 14 0% utilized cores, 16 100%
> > >> utilized cores and 2 50% utilized cores with about 300000 rps
> > >>
> > >> The question is that prevents knot thread on 0% used core to serve a
> > >> packet arrived with IRQ bounded to another core? May be you have some
> > >> developer guide can answer this question?
> > >>
> > 
> 


-- 
https://lists.nic.cz/cgi-bin/mailman/listinfo/knot-dns-users

Reply via email to