RE: Tuning suggestions for high-core-count Linux servers

2017-06-01 Thread Browne, Stuart
Cheers Matthew.

1)  Not seeing that error, seeing this one instead:

01-Jun-2017 01:46:27.952 client: warning: client 192.168.0.23#38125 
(x41fe848-f3d1-4eec-967e-039d075ee864.perf1000): error sending response: would 
block

Only seeing a few of them per run (out of ~70 million requests).

Whilst I can see where this is raised in the BIND code (lib/isc/unix/socket.c 
in doio_send), I don't understand the underlying reason for it being set (errno 
== EWOULDBLOCK || errno == EAGAIN).

I've not bumped wmem/rmem up as much as the link (only to 16MB, not 40MB), but 
no real difference after tweaks. I did another run with stupidly-large 
core.{rmem,wmem}_{max,default} (64MB), this actually degraded performance a bit 
so over tuning isn't good either. Need to figure out a good balance here.

I'd love to figure out what the math here should be.  'X number of simultaneous 
connections multiplied by Y socket memory size = rmem' or some such.

2) I am still seeing some udp receive errors and receive buffer errors; about 
1.3% of received packets.

From a 'netstat' point of view, I see:

Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address   Foreign Address State
udp   382976  17664 192.168.1.21:53 0.0.0.0:*

The numbers in the receive queue stay in the 200-300k range whilst the 
send-queue floats around the 20-40k range. wmem already bumped.

3) Huh, didn't know about this one. Bumped up the backlog, small increase in 
throughput for my tests. Still need to figure out how to read sofnet_stat. More 
google-fu in my future.

After a reboot and the wmem/rmem/backlog increases, no longer any non-zero in 
the 2nd column.

4) Yes, max_dgram_qlen is already set to 512.

5) Oo! new tool! :)

--
...
11 drops at location 0x815df171
854 drops at location 0x815e1c64
12 drops at location 0x815df171
822 drops at location 0x815e1c64
...
--

I'm pretty sure it's just showing more details of the 'netstat -u -s'. More 
google-fu to figure out how to use that information for good rather than, well, 
.. frustration? .. :)

Will keep spinning test but using smaller increments to the wmem/rmem values, 
see if I can eek anything more than 360k out of it.

Thanks for your suggestions Matthew!

Stuart


-Original Message-
From: Mathew Ian Eis [mailto:mathew@nau.edu] 
Sent: Thursday, 1 June 2017 10:30 AM
To: bind-users@lists.isc.org
Cc: Browne, Stuart
Subject: [EXTERNAL] Re: Tuning suggestions for high-core-count Linux servers

360k qps is actually quite good… the best I have heard of until now on EL was 
180k [1]. There, it was recommended to manually tune the number of subthreads 
with the -U parameter.



Since you’ve mentioned rmem/wmem changes, specifically you want to:



1. check for send buffer overflow; as indicated in named logs:

31-Mar-2017 12:30:55.521 client: warning: client 10.0.0.5#51342 (test.com): 
error sending response: unset



fix: increase rmem via sysctl:

net.core.rmem_max

net.core.rmem_default



2. check for receive buffer overflow; as indicated by netstat:

# netstat -u -s

Udp:

34772479 packet receive errors



fix: increase wmem and backlog via sysctl:

net.core.wmem_max

net.core.wmem_default



… and other ideas:



3. check 2nd column in /proc/net/softnet_stat for any non-zero numbers 
(indicating dropped packets).

If any are non-zero, increase net.core.netdev_max_backlog



4. You may also want to want to increase net.unix.max_dgram_qlen (although 
since EL7 has default this to 512, this is not much of an issue - double check 
that it is 512).



5. Try running dropwatch to see where packets are being lost. If it shows 
nothing then you need to look outside the system. If it shows something you may 
have a hint where to tune next.



Please post your outcomes in any case, since you are already having some 
excellent results.



[1] https://lists.dns-oarc.net/pipermail/dns-operations/2014-April/011543.html


Regards,



Mathew Eis

Northern Arizona University

Information Technology Services



-Original Message-

From: bind-users  on behalf of "Browne, 
Stuart" 

Date: Wednesday, May 31, 2017 at 12:25 AM

To: "bind-users@lists.isc.org" 

Subject: Tuning suggestions for high-core-count Linux servers



Hi,



I've been able to get my hands on some rather nice servers with 2 x 12 core 
Intel CPU's and was wondering if anybody had any decent tuning tips to get BIND 
to respond at a faster rate.



I'm seeing that pretty much cpu count beyond a single die doesn't get any 
real improvement. I understand the NUMA boundaries etc., but this hasn't been 
my experience on previous iterations of the Intel CPU's, at least not this 
dramatically. When I use more than a single die, CPU utilization continues to 
match the core count however throughput doesn't increase to match.



All the testing I've been doing for now (dnsperf from multiple sources for 
now) seems to be plateau

Re: Tuning suggestions for high-core-count Linux servers

2017-06-01 Thread Plhu

  Hello Stuart,
a few simple ideas to your tests:
 - have you inspected the per-thread CPU? Aren't some of the threads overloaded?
 - have you tried to get the statistics from the Bind server using the
 XML or JSON interface? It may bring you another insight to the errors.
 - I may have missed the connection count you use for testing - can you
 post it? More, how may entries do you have in your database? Can you
 share your named.conf (without any compromising entries)?
 - what is your network environment? How many switches/routers are there
 between your simulator and the Bind server host?
 - is Bind the only running process on the tested server?
 - what CPUs is the Bind server being run on?
 - is there numad running and while trying the taskset, have you
 selected the CPUs on the same processor? What does numastat show during
 the test?
 - how many UDP sockets are in use during your test?

Curious for the responses.

  Lukas

Browne, Stuart  writes:

> Cheers Matthew.
>
> 1)  Not seeing that error, seeing this one instead:
>
> 01-Jun-2017 01:46:27.952 client: warning: client 192.168.0.23#38125 
> (x41fe848-f3d1-4eec-967e-039d075ee864.perf1000): error sending response: 
> would block
>
> Only seeing a few of them per run (out of ~70 million requests).
>
> Whilst I can see where this is raised in the BIND code (lib/isc/unix/socket.c 
> in doio_send), I don't understand the underlying reason for it being set 
> (errno == EWOULDBLOCK || errno == EAGAIN).
>
> I've not bumped wmem/rmem up as much as the link (only to 16MB, not 40MB), 
> but no real difference after tweaks. I did another run with stupidly-large 
> core.{rmem,wmem}_{max,default} (64MB), this actually degraded performance a 
> bit so over tuning isn't good either. Need to figure out a good balance here.
>
> I'd love to figure out what the math here should be.  'X number of 
> simultaneous connections multiplied by Y socket memory size = rmem' or some 
> such.
>
> 2) I am still seeing some udp receive errors and receive buffer errors; about 
> 1.3% of received packets.
>
> From a 'netstat' point of view, I see:
>
> Active Internet connections (servers and established)
> Proto Recv-Q Send-Q Local Address   Foreign Address State
> udp   382976  17664 192.168.1.21:53 0.0.0.0:*
>
> The numbers in the receive queue stay in the 200-300k range whilst the 
> send-queue floats around the 20-40k range. wmem already bumped.
>
> 3) Huh, didn't know about this one. Bumped up the backlog, small increase in 
> throughput for my tests. Still need to figure out how to read sofnet_stat. 
> More google-fu in my future.
>
> After a reboot and the wmem/rmem/backlog increases, no longer any non-zero in 
> the 2nd column.
>
> 4) Yes, max_dgram_qlen is already set to 512.
>
> 5) Oo! new tool! :)
>
> --
> ...
> 11 drops at location 0x815df171
> 854 drops at location 0x815e1c64
> 12 drops at location 0x815df171
> 822 drops at location 0x815e1c64
> ...
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Stop Reverse resolution query Logging

2017-06-01 Thread Job
Dear guys,

is there a way in Bind 9 to stop logging (to bind.log standard file) all the 
in-addr.arpa queries?
We would like to log everything else but not the reverse resolution queries.

Thank you!
F
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Stop Reverse resolution query Logging

2017-06-01 Thread Rick Dicaire
Tried empty-zones-enable yes; in named.conf?

On Thu, Jun 1, 2017 at 10:28 AM, Job  wrote:
> Dear guys,
>
> is there a way in Bind 9 to stop logging (to bind.log standard file) all the 
> in-addr.arpa queries?
> We would like to log everything else but not the reverse resolution queries.
>
> Thank you!
> F
> ___
> Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
> from this list
>
> bind-users mailing list
> bind-users@lists.isc.org
> https://lists.isc.org/mailman/listinfo/bind-users
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Tuning suggestions for high-core-count Linux servers

2017-06-01 Thread Mathew Ian Eis
Howdy Stuart,

>  Re: net.core.rmem - I'd love to figure out what the math here should be. 'X 
> number of simultaneous connections multiplied by Y socket memory size = rmem' 
> or some such.

Basically the math here is “large enough that you can queue up the 9X.XXXth 
percentile of traffic bursts without dropping them, but not so large that you 
waste processing time fiddling with the queue”. Since that percentile varies 
widely across environments it’s not easy to provide a specific formula. And on 
that note:

> Will keep spinning test but using smaller increments to the wmem/rmem values

Tightening is nice for finding some theoretical limits but in practice not so 
much. Be careful about making them too tight, lest under your “bursty” 
production loads you drop all sorts of queries without intending to.

> Re: dropwatch - Oo! new tool! More google-fu to figure out how to use that 
> information for good

dropwatch is an easy indicator of whether the throughput issue is on or off the 
system. Seeing packets being dropped in the system combined with apparently low 
CPU usage suggests you might be able to increase throughput. `dropwatch -l kas` 
should tell you the methods that are dropping the packets, which can help you 
understand where in the kernel they are being dropped and why. For anything 
beyond that, I expect your Google-fu is as good as mine ;-)

If your CPU utilization is still apparently low, you might be onto something 
with taskset/numa… Related things I have toyed with but don’t currently have in 
production:

increasing kernel.sched_migration_cost a couple of orders of magnitude
setting kernel.sched_autogroup_enabled=0
systemctl stop irqbalance

Lastly (mostly for posterity for the list, please don’t take this as “rtfm” if 
you’ve seen them already) here are some very useful in-depth (but generalized) 
performance tuning guides:

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html-single/Performance_Tuning_Guide/
https://access.redhat.com/sites/default/files/attachments/201501-perf-brief-low-latency-tuning-rhel7-v1.1.pdf

… and for one last really crazy idea, you could try running a pair of named 
instances on the machine and fronting them with nginx’s supposedly scalable UDP 
load balancer. (As long as you don’t get a performance hit, it also opens up 
other interesting possibilities like being able to shift production load for 
maintenance on the named backends).

Best of luck! Let us know where you cap out!

Regards,

Mathew Eis
Northern Arizona University
Information Technology Services

-Original Message-
From: "Browne, Stuart" 
Date: Thursday, June 1, 2017 at 12:27 AM
To: Mathew Ian Eis , "bind-users@lists.isc.org" 

Subject: RE: Tuning suggestions for high-core-count Linux servers

Cheers Matthew.

1)  Not seeing that error, seeing this one instead:

01-Jun-2017 01:46:27.952 client: warning: client 192.168.0.23#38125 
(x41fe848-f3d1-4eec-967e-039d075ee864.perf1000): error sending response: would 
block

Only seeing a few of them per run (out of ~70 million requests).

Whilst I can see where this is raised in the BIND code 
(lib/isc/unix/socket.c in doio_send), I don't understand the underlying reason 
for it being set (errno == EWOULDBLOCK || errno == EAGAIN).

I've not bumped wmem/rmem up as much as the link (only to 16MB, not 40MB), 
but no real difference after tweaks. I did another run with stupidly-large 
core.{rmem,wmem}_{max,default} (64MB), this actually degraded performance a bit 
so over tuning isn't good either. Need to figure out a good balance here.

I'd love to figure out what the math here should be.  'X number of 
simultaneous connections multiplied by Y socket memory size = rmem' or some 
such.

2) I am still seeing some udp receive errors and receive buffer errors; 
about 1.3% of received packets.

From a 'netstat' point of view, I see:

Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address   Foreign Address State
udp   382976  17664 192.168.1.21:53 0.0.0.0:*

The numbers in the receive queue stay in the 200-300k range whilst the 
send-queue floats around the 20-40k range. wmem already bumped.

3) Huh, didn't know about this one. Bumped up the backlog, small increase 
in throughput for my tests. Still need to figure out how to read sofnet_stat. 
More google-fu in my future.

After a reboot and the wmem/rmem/backlog increases, no longer any non-zero 
in the 2nd column.

4) Yes, max_dgram_qlen is already set to 512.

5) Oo! new tool! :)

--
...
11 drops at location 0x815df171
854 drops at location 0x815e1c64
12 drops at location 0x815df171
822 drops at location 0x815e1c64
...
--

I'm pretty sure it's just showing more details of the 'netstat -u -s'. More 
google-fu to figure out

Re: Stop Reverse resolution query Logging

2017-06-01 Thread Mark Andrews

In message 
<88ef58f000ec4b4684700c2aa3a73d7a08180abd2...@w2008dc01.colliniconsulting.lan>, 
Job writes:
> Dear guys,
> 
> is there a way in Bind 9 to stop logging (to bind.log standard file) all the 
> in-addr.arpa queries?
> We would like to log everything else but not the reverse resolution queries.

No.
 
> Thank you!
> F
> ___
> Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
> from this list
> 
> bind-users mailing list
> bind-users@lists.isc.org
> https://lists.isc.org/mailman/listinfo/bind-users
-- 
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742 INTERNET: ma...@isc.org
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


RE: Stop Reverse resolution query Logging

2017-06-01 Thread Darcy Kevin (FCA)
BIND has no way of differentiating these queries, since reverse-lookup queries 
aren't "special".

But certainly, if you syslog rather than writing directly to a file, there are 
syslog daemons that can filter, based on regex'es and the like.


- Kevin

-Original Message-
From: bind-users [mailto:bind-users-boun...@lists.isc.org] On Behalf Of Job
Sent: Thursday, June 01, 2017 10:28 AM
To: bind-users@lists.isc.org
Subject: Stop Reverse resolution query Logging

Dear guys,

is there a way in Bind 9 to stop logging (to bind.log standard file) all the 
in-addr.arpa queries?
We would like to log everything else but not the reverse resolution queries.

Thank you!
F
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


RE: Tuning suggestions for high-core-count Linux servers

2017-06-01 Thread Browne, Stuart


> -Original Message-
> From: Mathew Ian Eis [mailto:mathew@nau.edu]
>

> 
> Basically the math here is “large enough that you can queue up the
> 9X.XXXth percentile of traffic bursts without dropping them, but not so
> large that you waste processing time fiddling with the queue”. Since that
> percentile varies widely across environments it’s not easy to provide a
> specific formula. And on that note:

Yup. Experimentation seems to the be name of the day.

> > Will keep spinning test but using smaller increments to the wmem/rmem
> > values
>
> Tightening is nice for finding some theoretical limits but in practice
> not so much. Be careful about making them too tight, lest under your
> “bursty” production loads you drop all sorts of queries without intending
> to.

Yup.

> dropwatch is an easy indicator of whether the throughput issue is on or
> off the system. Seeing packets being dropped in the system combined with
> apparently low CPU usage suggests you might be able to increase
> throughput. `dropwatch -l kas` should tell you the methods that are
> dropping the packets, which can help you understand where in the kernel
> they are being dropped and why. For anything beyond that, I expect your
> Google-fu is as good as mine ;-)

Like the '-l kas':

830 drops at udp_queue_rcv_skb+374 (0x815e1c64)
 15 drops at __udp_queue_rcv_skb+91 (0x815df171)

Well and truly buried in the code.

https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#udpqueuercvskb

This seems like a nice explanation as to what's going on. Still reading through 
it all.


> If your CPU utilization is still apparently low, you might be onto
> something with taskset/numa… Related things I have toyed with but don’t
> currently have in production:
> 
> increasing kernel.sched_migration_cost a couple of orders of magnitude
> 
> setting kernel.sched_autogroup_enabled=0
> 
> systemctl stop irqbalance

I've had irqbalance stopped previously, and sched_autogroup_enabled is already 
set to 0. Initial mucking about a bit with sched_migration_cost gets a few more 
QPS through, so will run more tests.

Thanks for this one, hadn't used it before.

> > Lastly (mostly for posterity for the list, please don’t take this as
> “rtfm” if you’ve seen them already) here are some very useful in-depth
> (but generalized) performance tuning guides:

Will give them a read. I do like manuals :P



> … and for one last really crazy idea, you could try running a pair of
> named instances on the machine and fronting them with nginx’s supposedly
> scalable UDP load balancer. (As long as you don’t get a performance hit,
> it also opens up other interesting possibilities like being able to shift
> production load for maintenance on the named backends).

Yeah, I've had this thought.

I'm pretty sure I've pretty much reached the limit of what BIND can do in a 
single NUMA node for the moment.

I will report back if any great inspiration or successful increases in 
throughput occur.

Stuart
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

RE: [EXTERNAL] Re: Tuning suggestions for high-core-count Linux servers

2017-06-01 Thread Browne, Stuart


> -Original Message-
> From: Plhu [mailto:p...@seznam.cz]


> a few simple ideas to your tests:
>  - have you inspected the per-thread CPU? Aren't some of the threads
> overloaded?

I've tested both the auto-calculated values (one thread per available core) and 
explicitly overridden this. NUMA boundaries seem to be where things get wonky.

>  - have you tried to get the statistics from the Bind server using the
>  XML or JSON interface? It may bring you another insight to the errors.


>  - I may have missed the connection count you use for testing - can you
>  post it? More, how may entries do you have in your database? Can you
>  share your named.conf (without any compromising entries)?

I'm testing to flood, so approximately 5 x 400 client count (dnsperf) with a 
500 query backlog per test instance.

Theoretically this should mean up to 4k5 active or back-logged connections (or 
just 2k5 if I read that documentation wrong).

>  - what is your network environment? How many switches/routers are there
>  between your simulator and the Bind server host?

This is a very closed environment. Server-Switch-Server, all 10Git or 25Gbit. 
Verified the switch stats today, capable of 10x what I'm pushing through it 
currently.

>  - is Bind the only running process on the tested server?

As always, there's the rest of the OS helper stuff, but BIND is the only thing 
actively doing anything (beyond the monitoring I'm doing). So no, nothing else 
is drawing massive amounts of either CPU or network resources.

>  - what CPUs is the Bind server being run on?

>From procinfo:
Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz

2 of them.


>  - is there numad running and while trying the taskset, have you
>  selected the CPUs on the same processor? What does numastat show during
>  the test?

I was manually issuing taskset after confirming the CPU allocations:

taskset 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,46,47 
/usr/sbin/named -u named -n 24 -f

This is all of the cores (including HT) on the 2nd socket. There wwas almost no 
performance difference between 12 (just the actual cores, no HT's) and 24 (with 
the HT's).

>  - how many UDP sockets are in use during your test?

See above.

> 
> Curious for the responses.
> 
>   Lukas

Stuart
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users