RE: Tuning suggestions for high-core-count Linux servers
Cheers Matthew. 1) Not seeing that error, seeing this one instead: 01-Jun-2017 01:46:27.952 client: warning: client 192.168.0.23#38125 (x41fe848-f3d1-4eec-967e-039d075ee864.perf1000): error sending response: would block Only seeing a few of them per run (out of ~70 million requests). Whilst I can see where this is raised in the BIND code (lib/isc/unix/socket.c in doio_send), I don't understand the underlying reason for it being set (errno == EWOULDBLOCK || errno == EAGAIN). I've not bumped wmem/rmem up as much as the link (only to 16MB, not 40MB), but no real difference after tweaks. I did another run with stupidly-large core.{rmem,wmem}_{max,default} (64MB), this actually degraded performance a bit so over tuning isn't good either. Need to figure out a good balance here. I'd love to figure out what the math here should be. 'X number of simultaneous connections multiplied by Y socket memory size = rmem' or some such. 2) I am still seeing some udp receive errors and receive buffer errors; about 1.3% of received packets. From a 'netstat' point of view, I see: Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State udp 382976 17664 192.168.1.21:53 0.0.0.0:* The numbers in the receive queue stay in the 200-300k range whilst the send-queue floats around the 20-40k range. wmem already bumped. 3) Huh, didn't know about this one. Bumped up the backlog, small increase in throughput for my tests. Still need to figure out how to read sofnet_stat. More google-fu in my future. After a reboot and the wmem/rmem/backlog increases, no longer any non-zero in the 2nd column. 4) Yes, max_dgram_qlen is already set to 512. 5) Oo! new tool! :) -- ... 11 drops at location 0x815df171 854 drops at location 0x815e1c64 12 drops at location 0x815df171 822 drops at location 0x815e1c64 ... -- I'm pretty sure it's just showing more details of the 'netstat -u -s'. More google-fu to figure out how to use that information for good rather than, well, .. frustration? .. :) Will keep spinning test but using smaller increments to the wmem/rmem values, see if I can eek anything more than 360k out of it. Thanks for your suggestions Matthew! Stuart -Original Message- From: Mathew Ian Eis [mailto:mathew@nau.edu] Sent: Thursday, 1 June 2017 10:30 AM To: bind-users@lists.isc.org Cc: Browne, Stuart Subject: [EXTERNAL] Re: Tuning suggestions for high-core-count Linux servers 360k qps is actually quite good… the best I have heard of until now on EL was 180k [1]. There, it was recommended to manually tune the number of subthreads with the -U parameter. Since you’ve mentioned rmem/wmem changes, specifically you want to: 1. check for send buffer overflow; as indicated in named logs: 31-Mar-2017 12:30:55.521 client: warning: client 10.0.0.5#51342 (test.com): error sending response: unset fix: increase rmem via sysctl: net.core.rmem_max net.core.rmem_default 2. check for receive buffer overflow; as indicated by netstat: # netstat -u -s Udp: 34772479 packet receive errors fix: increase wmem and backlog via sysctl: net.core.wmem_max net.core.wmem_default … and other ideas: 3. check 2nd column in /proc/net/softnet_stat for any non-zero numbers (indicating dropped packets). If any are non-zero, increase net.core.netdev_max_backlog 4. You may also want to want to increase net.unix.max_dgram_qlen (although since EL7 has default this to 512, this is not much of an issue - double check that it is 512). 5. Try running dropwatch to see where packets are being lost. If it shows nothing then you need to look outside the system. If it shows something you may have a hint where to tune next. Please post your outcomes in any case, since you are already having some excellent results. [1] https://lists.dns-oarc.net/pipermail/dns-operations/2014-April/011543.html Regards, Mathew Eis Northern Arizona University Information Technology Services -Original Message- From: bind-users on behalf of "Browne, Stuart" Date: Wednesday, May 31, 2017 at 12:25 AM To: "bind-users@lists.isc.org" Subject: Tuning suggestions for high-core-count Linux servers Hi, I've been able to get my hands on some rather nice servers with 2 x 12 core Intel CPU's and was wondering if anybody had any decent tuning tips to get BIND to respond at a faster rate. I'm seeing that pretty much cpu count beyond a single die doesn't get any real improvement. I understand the NUMA boundaries etc., but this hasn't been my experience on previous iterations of the Intel CPU's, at least not this dramatically. When I use more than a single die, CPU utilization continues to match the core count however throughput doesn't increase to match. All the testing I've been doing for now (dnsperf from multiple sources for now) seems to be plateau
Re: Tuning suggestions for high-core-count Linux servers
Hello Stuart, a few simple ideas to your tests: - have you inspected the per-thread CPU? Aren't some of the threads overloaded? - have you tried to get the statistics from the Bind server using the XML or JSON interface? It may bring you another insight to the errors. - I may have missed the connection count you use for testing - can you post it? More, how may entries do you have in your database? Can you share your named.conf (without any compromising entries)? - what is your network environment? How many switches/routers are there between your simulator and the Bind server host? - is Bind the only running process on the tested server? - what CPUs is the Bind server being run on? - is there numad running and while trying the taskset, have you selected the CPUs on the same processor? What does numastat show during the test? - how many UDP sockets are in use during your test? Curious for the responses. Lukas Browne, Stuart writes: > Cheers Matthew. > > 1) Not seeing that error, seeing this one instead: > > 01-Jun-2017 01:46:27.952 client: warning: client 192.168.0.23#38125 > (x41fe848-f3d1-4eec-967e-039d075ee864.perf1000): error sending response: > would block > > Only seeing a few of them per run (out of ~70 million requests). > > Whilst I can see where this is raised in the BIND code (lib/isc/unix/socket.c > in doio_send), I don't understand the underlying reason for it being set > (errno == EWOULDBLOCK || errno == EAGAIN). > > I've not bumped wmem/rmem up as much as the link (only to 16MB, not 40MB), > but no real difference after tweaks. I did another run with stupidly-large > core.{rmem,wmem}_{max,default} (64MB), this actually degraded performance a > bit so over tuning isn't good either. Need to figure out a good balance here. > > I'd love to figure out what the math here should be. 'X number of > simultaneous connections multiplied by Y socket memory size = rmem' or some > such. > > 2) I am still seeing some udp receive errors and receive buffer errors; about > 1.3% of received packets. > > From a 'netstat' point of view, I see: > > Active Internet connections (servers and established) > Proto Recv-Q Send-Q Local Address Foreign Address State > udp 382976 17664 192.168.1.21:53 0.0.0.0:* > > The numbers in the receive queue stay in the 200-300k range whilst the > send-queue floats around the 20-40k range. wmem already bumped. > > 3) Huh, didn't know about this one. Bumped up the backlog, small increase in > throughput for my tests. Still need to figure out how to read sofnet_stat. > More google-fu in my future. > > After a reboot and the wmem/rmem/backlog increases, no longer any non-zero in > the 2nd column. > > 4) Yes, max_dgram_qlen is already set to 512. > > 5) Oo! new tool! :) > > -- > ... > 11 drops at location 0x815df171 > 854 drops at location 0x815e1c64 > 12 drops at location 0x815df171 > 822 drops at location 0x815e1c64 > ... ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Stop Reverse resolution query Logging
Dear guys, is there a way in Bind 9 to stop logging (to bind.log standard file) all the in-addr.arpa queries? We would like to log everything else but not the reverse resolution queries. Thank you! F ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Stop Reverse resolution query Logging
Tried empty-zones-enable yes; in named.conf? On Thu, Jun 1, 2017 at 10:28 AM, Job wrote: > Dear guys, > > is there a way in Bind 9 to stop logging (to bind.log standard file) all the > in-addr.arpa queries? > We would like to log everything else but not the reverse resolution queries. > > Thank you! > F > ___ > Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe > from this list > > bind-users mailing list > bind-users@lists.isc.org > https://lists.isc.org/mailman/listinfo/bind-users ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Tuning suggestions for high-core-count Linux servers
Howdy Stuart, > Re: net.core.rmem - I'd love to figure out what the math here should be. 'X > number of simultaneous connections multiplied by Y socket memory size = rmem' > or some such. Basically the math here is “large enough that you can queue up the 9X.XXXth percentile of traffic bursts without dropping them, but not so large that you waste processing time fiddling with the queue”. Since that percentile varies widely across environments it’s not easy to provide a specific formula. And on that note: > Will keep spinning test but using smaller increments to the wmem/rmem values Tightening is nice for finding some theoretical limits but in practice not so much. Be careful about making them too tight, lest under your “bursty” production loads you drop all sorts of queries without intending to. > Re: dropwatch - Oo! new tool! More google-fu to figure out how to use that > information for good dropwatch is an easy indicator of whether the throughput issue is on or off the system. Seeing packets being dropped in the system combined with apparently low CPU usage suggests you might be able to increase throughput. `dropwatch -l kas` should tell you the methods that are dropping the packets, which can help you understand where in the kernel they are being dropped and why. For anything beyond that, I expect your Google-fu is as good as mine ;-) If your CPU utilization is still apparently low, you might be onto something with taskset/numa… Related things I have toyed with but don’t currently have in production: increasing kernel.sched_migration_cost a couple of orders of magnitude setting kernel.sched_autogroup_enabled=0 systemctl stop irqbalance Lastly (mostly for posterity for the list, please don’t take this as “rtfm” if you’ve seen them already) here are some very useful in-depth (but generalized) performance tuning guides: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html-single/Performance_Tuning_Guide/ https://access.redhat.com/sites/default/files/attachments/201501-perf-brief-low-latency-tuning-rhel7-v1.1.pdf … and for one last really crazy idea, you could try running a pair of named instances on the machine and fronting them with nginx’s supposedly scalable UDP load balancer. (As long as you don’t get a performance hit, it also opens up other interesting possibilities like being able to shift production load for maintenance on the named backends). Best of luck! Let us know where you cap out! Regards, Mathew Eis Northern Arizona University Information Technology Services -Original Message- From: "Browne, Stuart" Date: Thursday, June 1, 2017 at 12:27 AM To: Mathew Ian Eis , "bind-users@lists.isc.org" Subject: RE: Tuning suggestions for high-core-count Linux servers Cheers Matthew. 1) Not seeing that error, seeing this one instead: 01-Jun-2017 01:46:27.952 client: warning: client 192.168.0.23#38125 (x41fe848-f3d1-4eec-967e-039d075ee864.perf1000): error sending response: would block Only seeing a few of them per run (out of ~70 million requests). Whilst I can see where this is raised in the BIND code (lib/isc/unix/socket.c in doio_send), I don't understand the underlying reason for it being set (errno == EWOULDBLOCK || errno == EAGAIN). I've not bumped wmem/rmem up as much as the link (only to 16MB, not 40MB), but no real difference after tweaks. I did another run with stupidly-large core.{rmem,wmem}_{max,default} (64MB), this actually degraded performance a bit so over tuning isn't good either. Need to figure out a good balance here. I'd love to figure out what the math here should be. 'X number of simultaneous connections multiplied by Y socket memory size = rmem' or some such. 2) I am still seeing some udp receive errors and receive buffer errors; about 1.3% of received packets. From a 'netstat' point of view, I see: Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State udp 382976 17664 192.168.1.21:53 0.0.0.0:* The numbers in the receive queue stay in the 200-300k range whilst the send-queue floats around the 20-40k range. wmem already bumped. 3) Huh, didn't know about this one. Bumped up the backlog, small increase in throughput for my tests. Still need to figure out how to read sofnet_stat. More google-fu in my future. After a reboot and the wmem/rmem/backlog increases, no longer any non-zero in the 2nd column. 4) Yes, max_dgram_qlen is already set to 512. 5) Oo! new tool! :) -- ... 11 drops at location 0x815df171 854 drops at location 0x815e1c64 12 drops at location 0x815df171 822 drops at location 0x815e1c64 ... -- I'm pretty sure it's just showing more details of the 'netstat -u -s'. More google-fu to figure out
Re: Stop Reverse resolution query Logging
In message <88ef58f000ec4b4684700c2aa3a73d7a08180abd2...@w2008dc01.colliniconsulting.lan>, Job writes: > Dear guys, > > is there a way in Bind 9 to stop logging (to bind.log standard file) all the > in-addr.arpa queries? > We would like to log everything else but not the reverse resolution queries. No. > Thank you! > F > ___ > Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe > from this list > > bind-users mailing list > bind-users@lists.isc.org > https://lists.isc.org/mailman/listinfo/bind-users -- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: ma...@isc.org ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
RE: Stop Reverse resolution query Logging
BIND has no way of differentiating these queries, since reverse-lookup queries aren't "special". But certainly, if you syslog rather than writing directly to a file, there are syslog daemons that can filter, based on regex'es and the like. - Kevin -Original Message- From: bind-users [mailto:bind-users-boun...@lists.isc.org] On Behalf Of Job Sent: Thursday, June 01, 2017 10:28 AM To: bind-users@lists.isc.org Subject: Stop Reverse resolution query Logging Dear guys, is there a way in Bind 9 to stop logging (to bind.log standard file) all the in-addr.arpa queries? We would like to log everything else but not the reverse resolution queries. Thank you! F ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
RE: Tuning suggestions for high-core-count Linux servers
> -Original Message- > From: Mathew Ian Eis [mailto:mathew@nau.edu] > > > Basically the math here is “large enough that you can queue up the > 9X.XXXth percentile of traffic bursts without dropping them, but not so > large that you waste processing time fiddling with the queue”. Since that > percentile varies widely across environments it’s not easy to provide a > specific formula. And on that note: Yup. Experimentation seems to the be name of the day. > > Will keep spinning test but using smaller increments to the wmem/rmem > > values > > Tightening is nice for finding some theoretical limits but in practice > not so much. Be careful about making them too tight, lest under your > “bursty” production loads you drop all sorts of queries without intending > to. Yup. > dropwatch is an easy indicator of whether the throughput issue is on or > off the system. Seeing packets being dropped in the system combined with > apparently low CPU usage suggests you might be able to increase > throughput. `dropwatch -l kas` should tell you the methods that are > dropping the packets, which can help you understand where in the kernel > they are being dropped and why. For anything beyond that, I expect your > Google-fu is as good as mine ;-) Like the '-l kas': 830 drops at udp_queue_rcv_skb+374 (0x815e1c64) 15 drops at __udp_queue_rcv_skb+91 (0x815df171) Well and truly buried in the code. https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#udpqueuercvskb This seems like a nice explanation as to what's going on. Still reading through it all. > If your CPU utilization is still apparently low, you might be onto > something with taskset/numa… Related things I have toyed with but don’t > currently have in production: > > increasing kernel.sched_migration_cost a couple of orders of magnitude > > setting kernel.sched_autogroup_enabled=0 > > systemctl stop irqbalance I've had irqbalance stopped previously, and sched_autogroup_enabled is already set to 0. Initial mucking about a bit with sched_migration_cost gets a few more QPS through, so will run more tests. Thanks for this one, hadn't used it before. > > Lastly (mostly for posterity for the list, please don’t take this as > “rtfm” if you’ve seen them already) here are some very useful in-depth > (but generalized) performance tuning guides: Will give them a read. I do like manuals :P > … and for one last really crazy idea, you could try running a pair of > named instances on the machine and fronting them with nginx’s supposedly > scalable UDP load balancer. (As long as you don’t get a performance hit, > it also opens up other interesting possibilities like being able to shift > production load for maintenance on the named backends). Yeah, I've had this thought. I'm pretty sure I've pretty much reached the limit of what BIND can do in a single NUMA node for the moment. I will report back if any great inspiration or successful increases in throughput occur. Stuart ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
RE: [EXTERNAL] Re: Tuning suggestions for high-core-count Linux servers
> -Original Message- > From: Plhu [mailto:p...@seznam.cz] > a few simple ideas to your tests: > - have you inspected the per-thread CPU? Aren't some of the threads > overloaded? I've tested both the auto-calculated values (one thread per available core) and explicitly overridden this. NUMA boundaries seem to be where things get wonky. > - have you tried to get the statistics from the Bind server using the > XML or JSON interface? It may bring you another insight to the errors. > - I may have missed the connection count you use for testing - can you > post it? More, how may entries do you have in your database? Can you > share your named.conf (without any compromising entries)? I'm testing to flood, so approximately 5 x 400 client count (dnsperf) with a 500 query backlog per test instance. Theoretically this should mean up to 4k5 active or back-logged connections (or just 2k5 if I read that documentation wrong). > - what is your network environment? How many switches/routers are there > between your simulator and the Bind server host? This is a very closed environment. Server-Switch-Server, all 10Git or 25Gbit. Verified the switch stats today, capable of 10x what I'm pushing through it currently. > - is Bind the only running process on the tested server? As always, there's the rest of the OS helper stuff, but BIND is the only thing actively doing anything (beyond the monitoring I'm doing). So no, nothing else is drawing massive amounts of either CPU or network resources. > - what CPUs is the Bind server being run on? >From procinfo: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz 2 of them. > - is there numad running and while trying the taskset, have you > selected the CPUs on the same processor? What does numastat show during > the test? I was manually issuing taskset after confirming the CPU allocations: taskset 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,46,47 /usr/sbin/named -u named -n 24 -f This is all of the cores (including HT) on the 2nd socket. There wwas almost no performance difference between 12 (just the actual cores, no HT's) and 24 (with the HT's). > - how many UDP sockets are in use during your test? See above. > > Curious for the responses. > > Lukas Stuart ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users