Re: Tuning suggestions for high-core-count Linux servers

2017-06-02 Thread Paul Kosinski
It's been some years now, but I had worked on developing code for a high
throughput network server (not BIND). We found that on multi-socketed
NUMA machines we could have similar contention problems, and it was
quite important to make sure that threads which needed access to the
same memory areas weren't split across sockets. Luckily, the various
services being run were sufficiently separate that we could assign the
service processes to different sockets and avoid a lot of contention.

With BIND, it's basically all one service, so this is not directly
possible. 

It might be possible, however, to run two (or more) *separate*
instances of BIND and do some strictly internal routing of the IP
traffic to those separate instances, or even to have separate NICs
feeding the separate processes. In other words, have several BIND
servers in one chassis, each with its own NUMA memory area.



On Fri, 2 Jun 2017 07:12:09 +
"Browne, Stuart"  wrote:

> Just some interesting investigation results. One of the URL's Matthew
> Ian Eis linked to talked about using a tool called 'perf'. For the
> hell of it, I gave it a shot.
> 
> Sure enough it tells some very interesting things.
> 
> When BIND was restricted to using a single NUMA node, the biggest
> call (to _raw_spin_lock) showed 7.05% overhead.
> 
> When BIND was allowed to use both NUMA nodes, the same call showed
> 49.74% overhead; an astonishing difference.
> 
> As it was running unrestricted, memory from both nodes was more used:
> 
> [root@kr20s2601 ~]# numastat -p 22441
> 
> Per-node process memory usage (in MBs) for PID 22441 (named)
>Node 0  Node 1   Total
>   --- --- ---
> Huge 0.000.000.00
> Heap 0.450.120.57
> Stack0.710.641.35
> Private  5.28 9415.30 9420.57
>   --- --- ---
> Total6.43 9416.07 9422.50
> 
> Given the numbers here, you wouldn't think it should make much of a
> difference.
> 
> Sadly, I didn't get which CPU the UDP listener was attached to.
> 
> Anyway, what I've changed so far:
> 
> vm.swappines = 0
> vm.dirty_ratio = 1
> vm.dirty_background_ratio = 1
> kernel.sched_min_granularity_ns = 1000
> kernel.sched_migration_cost_ns = 500
> 
> Query rate thus far reached (on 24 cores, numa node restricted): 426k
> qps Query rate thus far reached (on 48 cores, numa nodes
> unrestricted): 321k qps
> 
> Stuart
> 
> 'perf' data collected during a 3 minute test run:
> 
> [root@kr20s2601 ~]# ls -al perf.data*
> -rw---. 1 root root  717350012 Jun  2 08:36 perf.data.24
> -rw---. 1 root root 1366620296 Jun  2 08:53 perf.data.48
> 
> 'perf' top 5 (24 cores, numa restricted):
> 
> Overhead  Command  Shared Object Symbol
>7.05%  named[kernel.kallsyms] [k] _raw_spin_lock
>6.96%  namedlibpthread-2.17.so[.] pthread_mutex_lock
>3.84%  namedlibc-2.17.so  [.] vfprintf
>2.36%  namedlibdns.so.165.0.7 [.] dns_name_fullcompare
>2.02%  namedlibisc.so.160.1.2 [.] isc_log_wouldlog
> 
> 'perf' top 5 (48 cores):
> 
> Overhead  Command  Shared Object Symbol
>   49.74%  named[kernel.kallsyms] [k] _raw_spin_lock
>4.52%  namedlibpthread-2.17.so[.] pthread_mutex_lock
>3.09%  namedlibisc.so.160.1.2 [.] isc_log_wouldlog
>1.84%  named[kernel.kallsyms] [k] _raw_spin_lock_bh
>1.56%  namedlibc-2.17.so  [.] vfprintf
> ___
> Please visit https://lists.isc.org/mailman/listinfo/bind-users to
> unsubscribe from this list
> 
> bind-users mailing list
> bind-users@lists.isc.org
> https://lists.isc.org/mailman/listinfo/bind-users
> 
> 
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Tuning suggestions for high-core-count Linux servers

2017-06-02 Thread Ray Bellis
On 01/06/2017 23:26, Mathew Ian Eis wrote:

> … and for one last really crazy idea, you could try running a pair of
> named instances on the machine and fronting them with nginx’s
> supposedly scalable UDP load balancer. (As long as you don’t get a
> performance hit, it also opens up other interesting possibilities
> like being able to shift production load for maintenance on the named
> backends).

It's relatively trivial to patch the BIND source to enable SO_REUSEPORT
on the more recent Linux kernels that support it (3.8+, ISTR?) so that
you can just start two BIND instances listening on the exact same ports
and the kernel will do the load balancing for you.

For a NUMA system, make sure each instance is locked to one die, but
beware of NUMA bus transfers caused by incoming packet buffers being
handled by a kernel task running on one die but then delivered to a BIND
instance running on another.

In the meantime we're also looking at SO_REUSEPORT even for single
instance installations because it appears to offer an advantage over
letting multiple threads all fight over one shared file descriptor.

Ray
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Re: Tuning suggestions for high-core-count Linux servers

2017-06-02 Thread Ray Bellis
On 02/06/2017 08:12, Browne, Stuart wrote:

> Query rate thus far reached (on 24 cores, numa node restricted): 426k qps
> Query rate thus far reached (on 48 cores, numa nodes unrestricted): 321k qps

In our internal Performance Lab I've achieved nearly 900 kqps on small
authoritative zones when we had hyperthreading enabled, and 700 kqps
without.

The lab uses Dell R430s running Fedora Core 23 with Intel X710 10GB NICs
and each populated with a single Xeon E5-2680 v3 2.5 GHz 12-core CPU.

These systems have had *negligible* tuning applied - the vast majority
of the system settings changes I've made have been to improve the
repeatability of results, not the absolute performance.

The only major setting I've found which both helps performance and
improves consistency is to ensure that each NIC rx/tx queue IRQ is
assigned to a specific CPU core, with irqbalance disabled.

This is with a _single_ dnsperf client, too.  The settings I use are
-c24 -q82 -T6 -x2048.   However I do use a tweaked version of dnsperf
which assigns each thread pair (it uses separate threads for rx and tx)
to its own core.

You may find the presentation I made at the recent DNS-OARC workshop of
interest:



You didn't mention precisely which 9.10 series version you're running.
Note that versions prior to 9.10.4 defaulted to a -U value of ncores/2,
but investigation showed that on modern systems this was sub-optimal so
it was changed to ncores-1.  This makes a *very* big difference.

kind regards,

Ray Bellis
ISC Research Fellow
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Tuning suggestions for high-core-count Linux servers

2017-06-02 Thread Phil Mayers

On 02/06/17 08:12, Browne, Stuart wrote:

Just some interesting investigation results. One of the URL's Matthew
Ian Eis linked to talked about using a tool called 'perf'. For the
hell of it, I gave it a shot.


perf is super-powerful.

On a sufficiently recent kernel you can also do interesting things with 
the enhanced eBPF-based tracing - see:


http://www.brendangregg.com/ebpf.html

...but those are not going to be usable on a RH7 kernel I believe :o(
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Stop Reverse resolution query Logging

2017-06-02 Thread /dev/rob0
On Thu, Jun 01, 2017 at 04:28:23PM +0200, Job wrote:
> is there a way in Bind 9 to stop logging (to bind.log standard 
> file) all the in-addr.arpa queries?

What "standard" is this?  The default logging for named goes to 
syslog, and from there it's up to your syslogd to decide if/where it 
should be written.

Perhaps what you want is a separate log channel for queries?  This is 
what I use:

logging {
channel "default_log" {
file "logs/named.log" versions unlimited size 4194304;
severity dynamic;
print-time yes;
print-severity yes;
print-category yes;
};
channel "query_log" {
file "logs/query.log" versions 10 size 2097152;
severity dynamic;
print-time yes;
};
category "default" {
"default_log";
};
category "queries" {
"query_log";
};
};

Those paths are relative to the "directory" which is set in your 
options{}.  Adjust to suit.

> We would like to log everything else but not the reverse
> resolution queries.

Why (and why not?)  What's the actual problem?  And what do you plan 
to do with all those query logs?  Query logging has a substantial 
impact on server performance.
-- 
  http://rob0.nodns4.us/
  Offlist GMX mail is seen only if "/dev/rob0" is in the Subject:
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Tuning suggestions for high-core-count Linux servers

2017-06-02 Thread Browne, Stuart
Just some interesting investigation results. One of the URL's Matthew Ian Eis 
linked to talked about using a tool called 'perf'. For the hell of it, I gave 
it a shot.

Sure enough it tells some very interesting things.

When BIND was restricted to using a single NUMA node, the biggest call (to 
_raw_spin_lock) showed 7.05% overhead.

When BIND was allowed to use both NUMA nodes, the same call showed 49.74% 
overhead; an astonishing difference.

As it was running unrestricted, memory from both nodes was more used:

[root@kr20s2601 ~]# numastat -p 22441

Per-node process memory usage (in MBs) for PID 22441 (named)
   Node 0  Node 1   Total
  --- --- ---
Huge 0.000.000.00
Heap 0.450.120.57
Stack0.710.641.35
Private  5.28 9415.30 9420.57
  --- --- ---
Total6.43 9416.07 9422.50

Given the numbers here, you wouldn't think it should make much of a difference.

Sadly, I didn't get which CPU the UDP listener was attached to.

Anyway, what I've changed so far:

vm.swappines = 0
vm.dirty_ratio = 1
vm.dirty_background_ratio = 1
kernel.sched_min_granularity_ns = 1000
kernel.sched_migration_cost_ns = 500

Query rate thus far reached (on 24 cores, numa node restricted): 426k qps
Query rate thus far reached (on 48 cores, numa nodes unrestricted): 321k qps

Stuart

'perf' data collected during a 3 minute test run:

[root@kr20s2601 ~]# ls -al perf.data*
-rw---. 1 root root  717350012 Jun  2 08:36 perf.data.24
-rw---. 1 root root 1366620296 Jun  2 08:53 perf.data.48

'perf' top 5 (24 cores, numa restricted):

Overhead  Command  Shared Object Symbol
   7.05%  named[kernel.kallsyms] [k] _raw_spin_lock
   6.96%  namedlibpthread-2.17.so[.] pthread_mutex_lock
   3.84%  namedlibc-2.17.so  [.] vfprintf
   2.36%  namedlibdns.so.165.0.7 [.] dns_name_fullcompare
   2.02%  namedlibisc.so.160.1.2 [.] isc_log_wouldlog

'perf' top 5 (48 cores):

Overhead  Command  Shared Object Symbol
  49.74%  named[kernel.kallsyms] [k] _raw_spin_lock
   4.52%  namedlibpthread-2.17.so[.] pthread_mutex_lock
   3.09%  namedlibisc.so.160.1.2 [.] isc_log_wouldlog
   1.84%  named[kernel.kallsyms] [k] _raw_spin_lock_bh
   1.56%  namedlibc-2.17.so  [.] vfprintf
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users