Re: Sporadic but noticable SERVFAILs in specific nodes of an anycast resolving farm running BIND
On 3/8/2014 1:30 PM, sth...@nethelp.no wrote: One mitigation approach is to blackhole the domains using local zones. That�s not much of a mitigation. Not having open resolvers would be mitigation. Not having open resolvers is good - but unfortunately doesn't help against misbehaving clients (e.g. small home routers with DNS proxies open to queries from the WAN side). There is a fairly long list of things that closing open resolvers won't fix, but one wonders how that is relevant? ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Sporadic but noticable SERVFAILs in specific nodes of an anycast resolving farm running BIND
LuKreme writes: > On 08 Mar 2014, at 12:52 , Kostas Zorbadelos wrote: > >> One mitigation approach is to blackhole the domains using local zones. > > That’s not much of a mitigation. Not having open resolvers would be > mitigation. It is a "quick and dirty" approach, since closing all open resolvers is much harder and wishful thinking. But of course I agree that actions must be made for the long-term solution al well. Regards, Kostas ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Sporadic but noticable SERVFAILs in specific nodes of an anycast resolving farm running BIND
> > One mitigation approach is to blackhole the domains using local zones. > > That?s not much of a mitigation. Not having open resolvers would be > mitigation. Not having open resolvers is good - but unfortunately doesn't help against misbehaving clients (e.g. small home routers with DNS proxies open to queries from the WAN side). Steinar Haug, Nethelp consulting, sth...@nethelp.no ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Sporadic but noticable SERVFAILs in specific nodes of an anycast resolving farm running BIND
On 08 Mar 2014, at 12:52 , Kostas Zorbadelos wrote: > One mitigation approach is to blackhole the domains using local zones. That’s not much of a mitigation. Not having open resolvers would be mitigation. -- Eyes the shady night has shut/Cannot see the record cut And silence sounds no worse than cheers/After earth has stopped the ears. ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Sporadic but noticable SERVFAILs in specific nodes of an anycast resolving farm running BIND
Hello, an update with the findings so far: - IPv6 config on the servers was an issue so we removed it and will test further later. There is a hint pointed from various people about a Linux kernel issue and setting (net.ipv6.route.max_size), see https://lists.dns-oarc.net/pipermail/dns-operations/2014-February/011366.html - our main issue was that we were being attacked. Open resolvers in our network were utilized to produce large amounts of queries with random subdomains of specific domains. Analyzing a small capture we noticed the following domains, but the list should not be considered complete I guess www.jxoyjt.com.cn liebiao.81ypf.com yuerengu.com.cn www.lgsf.net www.xxcfsb.com lie.zz85.com www.9009pk.com www.bcbang.com One mitigation approach is to blackhole the domains using local zones. -- Kostas Zorbadelos twitter:@kzorbadeloshttp://gr.linkedin.com/in/kzorba () www.asciiribbon.org - against HTML e-mail & proprietary attachments /\ ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Sporadic but noticable SERVFAILs in specific nodes of an anycast resolving farm running BIND
Answering myself: This bug is probably not your problem, as Bind has received the DNS query, otherwise it would not answer with SERVFAIL. regards Klaus On 05.03.2014 16:15, Klaus Darilion wrote: Does it only happen for IPv6 DNS requests? Maybe it is related to this: https://open.nlnetlabs.nl/pipermail/nsd-users/2014-January/001783.html klaus On 05.03.2014 14:16, Kostas Zorbadelos wrote: Greetings to all, we operate an anycast caching resolving farm for our customer base, based on CentOS (6.4 or 6.5), BIND (9.9.2, 9.9.5 or the stock CentOS package BIND 9.8.2rc1-RedHat-9.8.2-0.23.rc1.el6_5.1) and quagga (the stock CentOS package). The problem is that we have noticed sporadic but noticable SERVFAILs in 3 out of 10 total machines. Cacti measurements obtained via the BIND XML interface show traffic from 1.5K queries/sec (lowest loaded machines) to 15K queries/sec (highest). The problem is that in 3 specific machines in a geolocation with a BIND restart we notice after a period of time that can range between half an hour and several hours SERVFAILs in resolutions. The 3 machines do not have the highest load in the farm (6-8K q/sec). The resolution problems are noticable in the customers ending up in these machines but do not show up as high numbers in the BIND XML Resolver statistics (ServFail number). We reproduce the problem, by querying for a specific domain name using a loop of the form while [ 1 ]; do clear; rndc flushname www.linux-tutorial.info; sleep 1; dig www.linux-tutorial.info @localhost; sleep 2; done | grep SERVFAIL The www.linux-tutorial.info is not the only domain experiencing resolution problems of course. The above loop can run for hours even without issues on low-traffic hours (night, after a clean BIND restart) but during the day it shows quite a few SERVFAILs, which affect other domains as well. During the problem we notice with tcpdump, that when SERVFAIL is produced, no query packet exits the server for resolution. We have noticed nothing in BIND logs (we even tried to raise debugging levels and log all relevant categories). An example capture running the above loop: # tcpdump -nnn -i any -p dst port 53 or src port 53 | grep 'linux-tutorial' tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes 14:33:03.590908 IP6 ::1.53059 > ::1.53: 15773+ A? www.linux-tutorial.info. (41) 14:33:03.591292 IP 83.235.72.238.45157 > 213.133.105.6.53: 19156% [1au] A? www.linux-tutorial.info. (52) Success 14:33:06.664411 IP6 ::1.45090 > ::1.53: 48526+ A? www.linux-tutorial.info. (41) 14:33:06.664719 IP6 2a02:587:50da:b::1.23404 > 2a00:1158:4::add:a3.53: 30244% [1au] A? www.linux-tutorial.info. (52) Success 14:33:31.434209 IP6 ::1.43397 > ::1.53: 26607+ A? www.linux-tutorial.info. (41) SERVFAIL 14:33:43.672405 IP6 ::1.58282 > ::1.53: 27125+ A? www.linux-tutorial.info. (41) SERVFAIL 14:33:49.706645 IP6 ::1.54936 > ::1.53: 40435+ A? www.linux-tutorial.info. (41) 14:33:49.706976 IP6 2a02:587:50da:b::1.48961 > 2a00:1158:4::add:a3.53: 4287% [1au] A? www.linux-tutorial.info. (52) Success The main actions we have done on the problem machines are - change the BIND version (we initially used a custom compiled 9.9.2, we moved to 9.9.5 and finally switched over to the CentOS stock package 9.8.2rc1). We noticed the problem in all versions - disable IPtables (we use a ruleset with connection tracking in all of our machines with no problems on the other machines in the farm). Again no solution - introduce query-source-v6 address in named.conf (we already had query-source). Each machine has a single physical interface and 3 loopbacks with the anycast IPs, announced via Quagga ospfd to the rest of the network. No solution. The main difference in the 3 machines from the rest is the IPv6 operation. Those machines are dual stack, having /30 (v4) and /127 (v6) on the physical interface. Needless to say that the next trial is to remove the relevant IPv6 configuration. I understand that there are many parameters to the problem, we try and debug the issue several days now. Any suggestion, suspicion or hint is highly welcome. I can provide all sorts of traces from the machines (I already have pcap files at the moment of the problem, plus pstack, rndc status, OS process limits, rndc recursing, rndc dumpdb -all, according to https://kb.isc.org/article/AA-00341/0/What-to-do-with-a-misbehaving-BIND-server.html) Thanks in advance, Kostas ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org
RE: Sporadic but noticable SERVFAILs in specific nodes of an anycast resolving farm running BIND
Hello We are facing a similar problem by getting an intermittent SERVER FAILS on several domains and specifically during the high traffic. Please note that the IPV6 dual stack is not configured in the Operating system and we are not using any IPV6 option in the BIND configuration file. 1- We compiled several BIND versions on different CentOS platforms CentOS release 5.10 with BIND 9.9.5 and BIND 9.7.2-P2 : Problem Persists CentOS release 5.6 with BIND 9.9.5 and BIND 9.7.2-P2 : Proble Persits 2- We bypassed all network devices (Firewall, Shaper, IPS, LOADBALANCER): Problem persists 3- TCPDUMP performed on the name servers showed the SERVERFAIL in the capture 4- Dig debugging output shows intermittent SERVER FAIL: dig www.mcafee.com >HEADER<<- opcode: QUERY, status: SERVFAIL, id: 49448 ot fo other domains 5- We noticed during our debugging a failure when using dig +trace ;; Received 493 bytes from 192.5.5.241#53(f.root-servers.net) in 64 ms dig: couldn't get address for 'k.gtld-servers.net': failure Regards Daniel Dawalibi Senior Systems Engineer e-mail:daniel.dawal...@idm.net.lb Jisr Al Bacha P.O. Box 11-316 Beirut Lebanon tel +961 1 512513 ext. 366| fax +961 1 510474 tech support 1282 | http://www.idm.net.lb PLEASE CONSIDER THE ENVIRONMENT BEFORE YOU PRINT THIS E-MAIL Confidentiality Notice: The information in this document and attachments is confidential and may also be legally privileged. It is intended only for the use of the named recipient. Internet communications are not secure and therefore IDM does not accept legal responsibility for the contents of this message. If you are not the intended recipient, please notify us immediately and then delete this document. Do not disclose the contents of this document to any other person, nor take any copies. Violation of this notice may be unlawful. -Original Message- From: bind-users-bounces+daniel.dawalibi=idm.net...@lists.isc.org [mailto:bind-users-bounces+daniel.dawalibi=idm.net...@lists.isc.org] On Behalf Of Kostas Zorbadelos Sent: Wednesday, March 05, 2014 3:16 PM To: Bind Users Mailing List Subject: Sporadic but noticable SERVFAILs in specific nodes of an anycast resolving farm running BIND Greetings to all, we operate an anycast caching resolving farm for our customer base, based on CentOS (6.4 or 6.5), BIND (9.9.2, 9.9.5 or the stock CentOS package BIND 9.8.2rc1-RedHat-9.8.2-0.23.rc1.el6_5.1) and quagga (the stock CentOS package). The problem is that we have noticed sporadic but noticable SERVFAILs in 3 out of 10 total machines. Cacti measurements obtained via the BIND XML interface show traffic from 1.5K queries/sec (lowest loaded machines) to 15K queries/sec (highest). The problem is that in 3 specific machines in a geolocation with a BIND restart we notice after a period of time that can range between half an hour and several hours SERVFAILs in resolutions. The 3 machines do not have the highest load in the farm (6-8K q/sec). The resolution problems are noticable in the customers ending up in these machines but do not show up as high numbers in the BIND XML Resolver statistics (ServFail number). We reproduce the problem, by querying for a specific domain name using a loop of the form while [ 1 ]; do clear; rndc flushname www.linux-tutorial.info; sleep 1; dig www.linux-tutorial.info @localhost; sleep 2; done | grep SERVFAIL The www.linux-tutorial.info is not the only domain experiencing resolution problems of course. The above loop can run for hours even without issues on low-traffic hours (night, after a clean BIND restart) but during the day it shows quite a few SERVFAILs, which affect other domains as well. During the problem we notice with tcpdump, that when SERVFAIL is produced, no query packet exits the server for resolution. We have noticed nothing in BIND logs (we even tried to raise debugging levels and log all relevant categories). An example capture running the above loop: # tcpdump -nnn -i any -p dst port 53 or src port 53 | grep 'linux-tutorial' tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes 14:33:03.590908 IP6 ::1.53059 > ::1.53: 15773+ A? www.linux-tutorial.info. (41) 14:33:03.591292 IP 83.235.72.238.45157 > 213.133.105.6.53: 19156% [1au] A? www.linux-tutorial.info. (52) Success 14:33:06.664411 IP6 ::1.45090 > ::1.53: 48526+ A? www.linux-tutorial.info. (41) 14:33:06.664719 IP6 2a02:587:50da:b::1.23404 > 2a00:1158:4::add:a3.53: 30244% [1au] A? www.linux-tutorial.info. (52) Success 14:33:31.434209 IP6 ::1.43397 > ::1.53: 26607+ A? www.linux-tutorial.info. (41) SERVFAIL 14:33:43.672405 IP6 ::1.58282 > ::1.53: 27125+ A? www.linux-tutorial.info. (41) SERVFAIL 14:33:49.706645 IP6 ::1.54936 > ::1.53: 40435+ A? www.linux-tutorial.info. (41) 14:33:49.706976 IP6 2a02:587:50da:b::1.48961 > 2a00:1158:4::add:a3.53: 4287% [1au] A
Re: Sporadic but noticable SERVFAILs in specific nodes of an anycast resolving farm running BIND
On 05/03/14 15:15, Klaus Darilion wrote: > Does it only happen for IPv6 DNS requests? Maybe it is related to this: > https://open.nlnetlabs.nl/pipermail/nsd-users/2014-January/001783.html Or, less likely, this: http://marc.info/?l=linux-netdev&m=139352943109400&w=2 -- Marco ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Sporadic but noticable SERVFAILs in specific nodes of an anycast resolving farm running BIND
Does it only happen for IPv6 DNS requests? Maybe it is related to this: https://open.nlnetlabs.nl/pipermail/nsd-users/2014-January/001783.html klaus On 05.03.2014 14:16, Kostas Zorbadelos wrote: Greetings to all, we operate an anycast caching resolving farm for our customer base, based on CentOS (6.4 or 6.5), BIND (9.9.2, 9.9.5 or the stock CentOS package BIND 9.8.2rc1-RedHat-9.8.2-0.23.rc1.el6_5.1) and quagga (the stock CentOS package). The problem is that we have noticed sporadic but noticable SERVFAILs in 3 out of 10 total machines. Cacti measurements obtained via the BIND XML interface show traffic from 1.5K queries/sec (lowest loaded machines) to 15K queries/sec (highest). The problem is that in 3 specific machines in a geolocation with a BIND restart we notice after a period of time that can range between half an hour and several hours SERVFAILs in resolutions. The 3 machines do not have the highest load in the farm (6-8K q/sec). The resolution problems are noticable in the customers ending up in these machines but do not show up as high numbers in the BIND XML Resolver statistics (ServFail number). We reproduce the problem, by querying for a specific domain name using a loop of the form while [ 1 ]; do clear; rndc flushname www.linux-tutorial.info; sleep 1; dig www.linux-tutorial.info @localhost; sleep 2; done | grep SERVFAIL The www.linux-tutorial.info is not the only domain experiencing resolution problems of course. The above loop can run for hours even without issues on low-traffic hours (night, after a clean BIND restart) but during the day it shows quite a few SERVFAILs, which affect other domains as well. During the problem we notice with tcpdump, that when SERVFAIL is produced, no query packet exits the server for resolution. We have noticed nothing in BIND logs (we even tried to raise debugging levels and log all relevant categories). An example capture running the above loop: # tcpdump -nnn -i any -p dst port 53 or src port 53 | grep 'linux-tutorial' tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes 14:33:03.590908 IP6 ::1.53059 > ::1.53: 15773+ A? www.linux-tutorial.info. (41) 14:33:03.591292 IP 83.235.72.238.45157 > 213.133.105.6.53: 19156% [1au] A? www.linux-tutorial.info. (52) Success 14:33:06.664411 IP6 ::1.45090 > ::1.53: 48526+ A? www.linux-tutorial.info. (41) 14:33:06.664719 IP6 2a02:587:50da:b::1.23404 > 2a00:1158:4::add:a3.53: 30244% [1au] A? www.linux-tutorial.info. (52) Success 14:33:31.434209 IP6 ::1.43397 > ::1.53: 26607+ A? www.linux-tutorial.info. (41) SERVFAIL 14:33:43.672405 IP6 ::1.58282 > ::1.53: 27125+ A? www.linux-tutorial.info. (41) SERVFAIL 14:33:49.706645 IP6 ::1.54936 > ::1.53: 40435+ A? www.linux-tutorial.info. (41) 14:33:49.706976 IP6 2a02:587:50da:b::1.48961 > 2a00:1158:4::add:a3.53: 4287% [1au] A? www.linux-tutorial.info. (52) Success The main actions we have done on the problem machines are - change the BIND version (we initially used a custom compiled 9.9.2, we moved to 9.9.5 and finally switched over to the CentOS stock package 9.8.2rc1). We noticed the problem in all versions - disable IPtables (we use a ruleset with connection tracking in all of our machines with no problems on the other machines in the farm). Again no solution - introduce query-source-v6 address in named.conf (we already had query-source). Each machine has a single physical interface and 3 loopbacks with the anycast IPs, announced via Quagga ospfd to the rest of the network. No solution. The main difference in the 3 machines from the rest is the IPv6 operation. Those machines are dual stack, having /30 (v4) and /127 (v6) on the physical interface. Needless to say that the next trial is to remove the relevant IPv6 configuration. I understand that there are many parameters to the problem, we try and debug the issue several days now. Any suggestion, suspicion or hint is highly welcome. I can provide all sorts of traces from the machines (I already have pcap files at the moment of the problem, plus pstack, rndc status, OS process limits, rndc recursing, rndc dumpdb -all, according to https://kb.isc.org/article/AA-00341/0/What-to-do-with-a-misbehaving-BIND-server.html) Thanks in advance, Kostas ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users