RE: Frequent timeout
I will walk back my previous comments and just say that bandwidth may be in play because anytime you soak a circuit it is not good. Take a look at this query sequence: dns.qry.type == 28 && dns.qry.name == concured.co Packet 42356 shows a query for concurred.co. Packets 42357/8 show 68.195.193.45 relaying the query to 62.138.132.21. Packets 43015/16 show 62.138.132.21 replying with its query response to 68.195.193.45. And that's it. Nothing is seen being sent back to 127.0.0.1. At least on the wire. By way of comparison, packet 161 shows 127.0.0.1 answering itself so I would consider the previous no response a clue. Moving on: Packet 48874 shows 127.0.0.1 asking for a record again. This time we don’t see any external communication. Packet 87174 shows 127.0.0.1 replying with server failure. It took nearly 25 seconds to decide upon a SERVFAIL and that is another clue. That said, there a heaps of queries where DNS worked as expected. I really had to dig for the above examples because it seems like the vast majority of the server failure messages either do not get a reply on the localhost or we don’t see the routable adapter on the server attempting to reach out to get the answer. concurred.co is unique in that we see that attempt to reach out and no attempt. If the traffic that 127.0.0.1 is putting on the wire does not go out I am thinking firewall but you may be dealing with bandwidth exhaustion exclusively and it is presenting itself in this manner. Or you may have a server configuration issues or a server that is under powered. Sometimes pcap's are black and white and it gives you a "here is your problem" answer and other times it is like this where it does not give us anything conclusively to work with. Since this sever is sputtering around I would set about first stabilizing traffic from 127.0.0.1 going out. You need to see outbound traffic hit 127.0.0.1 then hit your external adapter without missing. Boom, boom, boom on down the line. Hopefully others may have better more insightful suggestions. Good hunting! John -Original Message- From: Alex [mailto:mysqlstud...@gmail.com] Sent: Tuesday, September 11, 2018 1:57 PM To: John W. Blue; bind-users@lists.isc.org Subject: Re: Frequent timeout Hi, On Tue, Sep 11, 2018 at 2:47 PM John W. Blue wrote: > > If you use wireshark to slice n dice the pcap .. "dns.flags.rcode == 2" shows > all of your SERVFAIL happens on localhost. > > If you switch to "dns.qry.name == storage.pardot.com" every single query is > localhost. > > Unless you have another NIC that you are sending traffic over this does not > look like a bandwidth issue at this particular point in time. Thanks so much. I think I also may have confused things by suggesting it was related to bandwidth or utilization. I see it also happen now more regularly too. Can you ascertain why it is reporting these SERVFAILs? The queries are on localhost because /etc/resolv.conf lists localhost as the nameserver. Is that why we can't diagnose this? This most recent packet trace was started with "-i any". Why would the ones on localhost be the ones which are failing? I'm assuming postfix and/or some other process is querying bind on localhost to cause these errors? ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Frequent timeout
Hi, On Tue, Sep 11, 2018 at 2:47 PM John W. Blue wrote: > > If you use wireshark to slice n dice the pcap .. "dns.flags.rcode == 2" shows > all of your SERVFAIL happens on localhost. > > If you switch to "dns.qry.name == storage.pardot.com" every single query is > localhost. > > Unless you have another NIC that you are sending traffic over this does not > look like a bandwidth issue at this particular point in time. Thanks so much. I think I also may have confused things by suggesting it was related to bandwidth or utilization. I see it also happen now more regularly too. Can you ascertain why it is reporting these SERVFAILs? The queries are on localhost because /etc/resolv.conf lists localhost as the nameserver. Is that why we can't diagnose this? This most recent packet trace was started with "-i any". Why would the ones on localhost be the ones which are failing? I'm assuming postfix and/or some other process is querying bind on localhost to cause these errors? ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
RE: Frequent timeout
If you use wireshark to slice n dice the pcap .. "dns.flags.rcode == 2" shows all of your SERVFAIL happens on localhost. If you switch to "dns.qry.name == storage.pardot.com" every single query is localhost. Unless you have another NIC that you are sending traffic over this does not look like a bandwidth issue at this particular point in time. John -Original Message- From: Alex [mailto:mysqlstud...@gmail.com] Sent: Tuesday, September 11, 2018 1:19 PM To: bind-users@lists.isc.org; John W. Blue Subject: Re: Frequent timeout Hi, Here is a much more reasonable network capture during the period where there are numerous SERVFAIL errors from bind over a short period of high utilization. https://drive.google.com/file/d/1UrzvB-pumVjPvlmd6ZSnHi-XVynI8y3y/view?usp=sharing This is when our 20mbs cable upstream link was saturated and resulted in DNS query timeout errors. resulting in these SERVFAIL messages. The packet trace shows multiple TCP out-of-order and TCP Dup ACK packets. Would these retransmits cause enough of a delay for the queries to fail? Would someone more knowledgeable look into these packet errors for me? It might seem obvious that we should increase the bandwidth of our link, since it occurs during periods of high utilization, but it doesn't occur on our other 10mbs DIA links in the datacenter when the link is saturated. 11-Sep-2018 11:53:25.692 query-errors: info: client @0x7fc7ef343740 127.0.0.1#50821 (8cb54bfffc54eee06342d5619246d67166abc6cf.ebl.msbl.org): query failed (SERVFAIL) for 8cb54bfffc54eee06342d5619246d67166abc6cf.ebl.msbl.org/IN/A at ../../../bin/named/query.c:8580 11-Sep-2018 11:53:25.687 query-errors: debug 2: fetch completed at ../../../lib/dns/resolver.c:3927 for ac949d5d947f8f5cad13e98c68bac6f284c367fd.ebl.msbl.org/A in 30.84: timed out/success [domain:ebl.msbl.org,referral:0,restart:6,qrysent:11,timeout:10,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0] Thanks, Alex On Mon, Sep 10, 2018 at 12:11 PM Alex wrote: > > Hi, > > > >> tcpdump -s0 -n -i eth0 port domain -w /tmp/domaincapture.pcap > > >> > > >> You don't need all of the extra stuff because -s0 captures the full > > >> packet. > > > > On 06.09.18 18:42, Alex wrote: > > >This is the command I ran to produce the pcap file I sent: > > > > > ># tcpdump -s0 -vv -i eth0 -nn -w domain-capture-eth0-090518.pcap > > >udp dst port domain > > > > and that is the problem. "dst port domain" captures packets going to > > DNS servers, not responses coming back. > > > > "-vv" and "-nn" are useless when producing packet capture and "-s0" > > is default for some time. I often add "-U" so file is flushed wich each > > packet. > > > > you can strip incoming queries by using filter > > > > "(src host 68.195.XXX.45 and dst port domain) or (src port domain and dst > > host 68.195.XXX.45)" > > I've generated a new tcpdump file using these criteria and uploaded it here: > https://drive.google.com/file/d/1F0VML8yPZJbcDZTys2hXDhjzv1UaBHuV/view > ?usp=sharing > > The SERVFAIL errors didn't really occur over the weekend. I believe it > has something to do with mail volume, link congestion/bandwidth > utilization. > > Thanks, > Alex > > > ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Frequent timeout
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 On Tue, 2018-09-11 at 14:19 -0400, Alex wrote: > This is when our 20mbs cable upstream link was saturated and resulted > in DNS query timeout errors. resulting in these SERVFAIL messages. Not specific to dns, but this looks like a bufferbloat problem, which is common with cable modems. When the upstream link is saturated, the buffers in the interface device (cable modem or possibly a standalone router) become full. If there is a lot of buffer space, the latency becomes very large, and that will cause many problems, including issues with dns. A partial fix is to prioritize small packets like dns queries and tcp acks, so they don't wait behind a large queue of full size packets. A more complete fix is switching to fq-codel queue discipline. google for bufferbloat for more details. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.14 (GNU/Linux) iEYEAREKAAYFAluYDHMACgkQL6j7milTFsEqXwCffaR+fwcqpoEHPisw86Q49+Kw o0cAn0Q5LV1FXk2r1fiTqYZIlsa9xH3s =yp3H -END PGP SIGNATURE- ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Frequent timeout
Hi, Here is a much more reasonable network capture during the period where there are numerous SERVFAIL errors from bind over a short period of high utilization. https://drive.google.com/file/d/1UrzvB-pumVjPvlmd6ZSnHi-XVynI8y3y/view?usp=sharing This is when our 20mbs cable upstream link was saturated and resulted in DNS query timeout errors. resulting in these SERVFAIL messages. The packet trace shows multiple TCP out-of-order and TCP Dup ACK packets. Would these retransmits cause enough of a delay for the queries to fail? Would someone more knowledgeable look into these packet errors for me? It might seem obvious that we should increase the bandwidth of our link, since it occurs during periods of high utilization, but it doesn't occur on our other 10mbs DIA links in the datacenter when the link is saturated. 11-Sep-2018 11:53:25.692 query-errors: info: client @0x7fc7ef343740 127.0.0.1#50821 (8cb54bfffc54eee06342d5619246d67166abc6cf.ebl.msbl.org): query failed (SERVFAIL) for 8cb54bfffc54eee06342d5619246d67166abc6cf.ebl.msbl.org/IN/A at ../../../bin/named/query.c:8580 11-Sep-2018 11:53:25.687 query-errors: debug 2: fetch completed at ../../../lib/dns/resolver.c:3927 for ac949d5d947f8f5cad13e98c68bac6f284c367fd.ebl.msbl.org/A in 30.84: timed out/success [domain:ebl.msbl.org,referral:0,restart:6,qrysent:11,timeout:10,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0] Thanks, Alex On Mon, Sep 10, 2018 at 12:11 PM Alex wrote: > > Hi, > > > >> tcpdump -s0 -n -i eth0 port domain -w /tmp/domaincapture.pcap > > >> > > >> You don't need all of the extra stuff because -s0 captures the full > > >> packet. > > > > On 06.09.18 18:42, Alex wrote: > > >This is the command I ran to produce the pcap file I sent: > > > > > ># tcpdump -s0 -vv -i eth0 -nn -w domain-capture-eth0-090518.pcap udp > > >dst port domain > > > > and that is the problem. "dst port domain" captures packets going to DNS > > servers, not responses coming back. > > > > "-vv" and "-nn" are useless when producing packet capture and "-s0" is > > default for some time. I often add "-U" so file is flushed wich each packet. > > > > you can strip incoming queries by using filter > > > > "(src host 68.195.XXX.45 and dst port domain) or (src port domain and dst > > host 68.195.XXX.45)" > > I've generated a new tcpdump file using these criteria and uploaded it here: > https://drive.google.com/file/d/1F0VML8yPZJbcDZTys2hXDhjzv1UaBHuV/view?usp=sharing > > The SERVFAIL errors didn't really occur over the weekend. I believe it > has something to do with mail volume, link congestion/bandwidth > utilization. > > Thanks, > Alex > > > ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Frequent timeout
Hi, > >> tcpdump -s0 -n -i eth0 port domain -w /tmp/domaincapture.pcap > >> > >> You don't need all of the extra stuff because -s0 captures the full packet. > > On 06.09.18 18:42, Alex wrote: > >This is the command I ran to produce the pcap file I sent: > > > ># tcpdump -s0 -vv -i eth0 -nn -w domain-capture-eth0-090518.pcap udp > >dst port domain > > and that is the problem. "dst port domain" captures packets going to DNS > servers, not responses coming back. > > "-vv" and "-nn" are useless when producing packet capture and "-s0" is > default for some time. I often add "-U" so file is flushed wich each packet. > > you can strip incoming queries by using filter > > "(src host 68.195.XXX.45 and dst port domain) or (src port domain and dst > host 68.195.XXX.45)" I've generated a new tcpdump file using these criteria and uploaded it here: https://drive.google.com/file/d/1F0VML8yPZJbcDZTys2hXDhjzv1UaBHuV/view?usp=sharing The SERVFAIL errors didn't really occur over the weekend. I believe it has something to do with mail volume, link congestion/bandwidth utilization. Thanks, Alex > > >I should also mention that, while eth0 is the physical device, there > >is a bridge set up to support virtual machines (none of which were > >active). Hopefully that's not the reason! (real IP obscured). > > not the reason, but using "-i br0" could be safer then. > > Note that the IP was seen in packet capture you have published, not needed > to hide it now. > > -- > Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ > Warning: I wish NOT to receive e-mail advertising to this address. > Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. > They that can give up essential liberty to obtain a little temporary > safety deserve neither liberty nor safety. -- Benjamin Franklin, 1759 > ___ > Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe > from this list > > bind-users mailing list > bind-users@lists.isc.org > https://lists.isc.org/mailman/listinfo/bind-users ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Frequent timeout
On Thu, Sep 6, 2018 at 5:56 PM John W. Blue wrote: So that file is full of nothing but queries and no responses which, sadly, is useless. Run: tcpdump -s0 -n -i eth0 port domain -w /tmp/domaincapture.pcap You don't need all of the extra stuff because -s0 captures the full packet. On 06.09.18 18:42, Alex wrote: This is the command I ran to produce the pcap file I sent: # tcpdump -s0 -vv -i eth0 -nn -w domain-capture-eth0-090518.pcap udp dst port domain and that is the problem. "dst port domain" captures packets going to DNS servers, not responses coming back. "-vv" and "-nn" are useless when producing packet capture and "-s0" is default for some time. I often add "-U" so file is flushed wich each packet. you can strip incoming queries by using filter "(src host 68.195.XXX.45 and dst port domain) or (src port domain and dst host 68.195.XXX.45)" I should also mention that, while eth0 is the physical device, there is a bridge set up to support virtual machines (none of which were active). Hopefully that's not the reason! (real IP obscured). not the reason, but using "-i br0" could be safer then. Note that the IP was seen in packet capture you have published, not needed to hide it now. -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety. -- Benjamin Franklin, 1759 ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Frequent timeout
On Thu, Sep 6, 2018 at 5:56 PM John W. Blue wrote: > > So that file is full of nothing but queries and no responses which, sadly, is > useless. > > Run: > > tcpdump -s0 -n -i eth0 port domain -w /tmp/domaincapture.pcap > > You don't need all of the extra stuff because -s0 captures the full packet. This is the command I ran to produce the pcap file I sent: # tcpdump -s0 -vv -i eth0 -nn -w domain-capture-eth0-090518.pcap udp dst port domain I have a few other pcap files here. Can you tell me the query you ran in wireshark to search for the SERVFAIL packets? Perhaps I can find them here. I have another that I just realized was running for quite a while and has grown to 1.5GB until I just stopped it. I also have another that was run with "-i any", but it's also quite large. I'd otherwise probably have to wait until tomorrow to run it again, as it appears to happen during periods of high traffic. I should also mention that, while eth0 is the physical device, there is a bridge set up to support virtual machines (none of which were active). Hopefully that's not the reason! (real IP obscured). br0: flags=4163 mtu 1500 inet 68.195.XXX.45 netmask 255.255.255.248 broadcast 68.195.XXX.47 inet6 fe80::16da:e9ff:fe97:ab71 prefixlen 64 scopeid 0x20 inet6 ::16da:e9ff:fe97:ab71 prefixlen 64 scopeid 0x0 ether 14:da:e9:97:ab:71 txqueuelen 1000 (Ethernet) RX packets 54953236 bytes 45182800578 (42.0 GiB) RX errors 0 dropped 231612 overruns 0 frame 0 TX packets 68345276 bytes 33687959055 (31.3 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 eth0: flags=4163 mtu 1500 inet6 fe80::16da:e9ff:fe97:ab71 prefixlen 64 scopeid 0x20 ether 14:da:e9:97:ab:71 txqueuelen 1000 (Ethernet) RX packets 61078845 bytes 46596159121 (43.3 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 68733541 bytes 34028363069 (31.6 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 16 memory 0xdf20-df22 Thanks, Alex > > John > > -Original Message- > From: bind-users [mailto:bind-users-boun...@lists.isc.org] On Behalf Of Alex > Sent: Thursday, September 06, 2018 2:54 PM > To: bind-users@lists.isc.org > Subject: Re: Frequent timeout > > On Thu, Sep 6, 2018 at 3:05 PM John W. Blue wrote: > > > > Alex, > > > > Have you uploaded this pcap with the SERVFAIL's? I didn't have time to > > look at your first upload but can review this one. > > Thanks very much. I've uploaded the pcap file here. It's about ~100MB > compressed, and represents about 4hrs of data, I believe. > https://drive.google.com/file/d/1KUpDoQ2zuz5ITeKuO0BhlK7JvWSUAG3B/view?usp=sharing > > Thanks, > Alex > ___ > Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe > from this list > > bind-users mailing list > bind-users@lists.isc.org > https://lists.isc.org/mailman/listinfo/bind-users ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
RE: Frequent timeout
So that file is full of nothing but queries and no responses which, sadly, is useless. Run: tcpdump -s0 -n -i eth0 port domain -w /tmp/domaincapture.pcap You don't need all of the extra stuff because -s0 captures the full packet. John -Original Message- From: bind-users [mailto:bind-users-boun...@lists.isc.org] On Behalf Of Alex Sent: Thursday, September 06, 2018 2:54 PM To: bind-users@lists.isc.org Subject: Re: Frequent timeout On Thu, Sep 6, 2018 at 3:05 PM John W. Blue wrote: > > Alex, > > Have you uploaded this pcap with the SERVFAIL's? I didn't have time to look > at your first upload but can review this one. Thanks very much. I've uploaded the pcap file here. It's about ~100MB compressed, and represents about 4hrs of data, I believe. https://drive.google.com/file/d/1KUpDoQ2zuz5ITeKuO0BhlK7JvWSUAG3B/view?usp=sharing Thanks, Alex ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Frequent timeout
On Thu, Sep 6, 2018 at 3:05 PM John W. Blue wrote: > > Alex, > > Have you uploaded this pcap with the SERVFAIL's? I didn't have time to look > at your first upload but can review this one. Thanks very much. I've uploaded the pcap file here. It's about ~100MB compressed, and represents about 4hrs of data, I believe. https://drive.google.com/file/d/1KUpDoQ2zuz5ITeKuO0BhlK7JvWSUAG3B/view?usp=sharing Thanks, Alex > > John > > -Original Message- > From: bind-users [mailto:bind-users-boun...@lists.isc.org] On Behalf Of Alex > Sent: Thursday, September 06, 2018 1:49 PM > To: c...@byington.org; bind-users@lists.isc.org > Subject: Re: Frequent timeout > > Hi, > > On Mon, Sep 3, 2018 at 12:45 PM Carl Byington wrote: > > > > -BEGIN PGP SIGNED MESSAGE- > > Hash: SHA512 > > > > On Sun, 2018-09-02 at 21:54 -0400, Alex wrote: > > > Do you have any other ideas on how I can isolate this problem? > > > > Run tcpdump on the external ethernet connection. > > > > tcpdump -s0 -vv -i %s -nn -w /tmp/outputfile udp dst port domain > > I've captured some packets that I believe include the packets relating to the > SERVFAIL errors I've been receiving. Now I have to figure out how to go > through them. > > In the meantime, I've configured /etc/resolv.conf to send queries to a remote > system of ours, and the errors have (mostly) stopped. > > I also notice some traces take an abnormal amount of time. Ping times to > google.com are less than 20ms, but this trace shows reaching the root servers > takes 104ms: > > # dig +trace +nodnssec google.com > > ; <<>> DiG 9.11.4-P1-RedHat-9.11.4-5.P1.fc28 <<>> +trace +nodnssec google.com > ;; global options: +cmd > . 3451IN NS g.root-servers.net. > . 3451IN NS k.root-servers.net. > . 3451IN NS j.root-servers.net. > . 3451IN NS c.root-servers.net. > . 3451IN NS i.root-servers.net. > . 3451IN NS e.root-servers.net. > . 3451IN NS m.root-servers.net. > . 3451IN NS l.root-servers.net. > . 3451IN NS a.root-servers.net. > . 3451IN NS h.root-servers.net. > . 3451IN NS b.root-servers.net. > . 3451IN NS d.root-servers.net. > . 3451IN NS f.root-servers.net. > ;; Received 839 bytes from 127.0.0.1#53(127.0.0.1) in 0 ms > > com.172800 IN NS h.gtld-servers.net. > com.172800 IN NS g.gtld-servers.net. > com.172800 IN NS b.gtld-servers.net. > com.172800 IN NS j.gtld-servers.net. > com.172800 IN NS f.gtld-servers.net. > com.172800 IN NS m.gtld-servers.net. > com.172800 IN NS c.gtld-servers.net. > com.172800 IN NS d.gtld-servers.net. > com.172800 IN NS k.gtld-servers.net. > com.172800 IN NS i.gtld-servers.net. > com.172800 IN NS l.gtld-servers.net. > com.172800 IN NS a.gtld-servers.net. > com.172800 IN NS e.gtld-servers.net. > ;; Received 835 bytes from 202.12.27.33#53(m.root-servers.net) in 104 ms > > google.com. 172800 IN NS ns2.google.com. > google.com. 172800 IN NS ns1.google.com. > google.com. 172800 IN NS ns3.google.com. > google.com. 172800 IN NS ns4.google.com. > ;; Received 287 bytes from 192.33.14.30#53(b.gtld-servers.net) in 44 ms > > ;; expected opt record in response > google.com. 300 IN A 172.217.10.14 > ;; Received 44 bytes from 216.239.36.10#53(ns3.google.com) in 29 ms > > Running the same trace again showed 129ms. > > I also located this warning: > 06-Sep-2018 12:03:33.304 client: warning: client @0x7f502c1d3d50 > 127.0.0.1#60968 (cmail20.com.multi.surbl.org): recursive-clients soft limit > exceeded (901/900/1000), aborting oldest query > > I've increased recursive-clients to 2500 but the SERVFAIL errors continue. > > There are also a ton of lame-server entries, many of which are related to one > RBL or another, as part of my postscreen config: > 06-Se
RE: Frequent timeout
Alex, Have you uploaded this pcap with the SERVFAIL's? I didn't have time to look at your first upload but can review this one. John -Original Message- From: bind-users [mailto:bind-users-boun...@lists.isc.org] On Behalf Of Alex Sent: Thursday, September 06, 2018 1:49 PM To: c...@byington.org; bind-users@lists.isc.org Subject: Re: Frequent timeout Hi, On Mon, Sep 3, 2018 at 12:45 PM Carl Byington wrote: > > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA512 > > On Sun, 2018-09-02 at 21:54 -0400, Alex wrote: > > Do you have any other ideas on how I can isolate this problem? > > Run tcpdump on the external ethernet connection. > > tcpdump -s0 -vv -i %s -nn -w /tmp/outputfile udp dst port domain I've captured some packets that I believe include the packets relating to the SERVFAIL errors I've been receiving. Now I have to figure out how to go through them. In the meantime, I've configured /etc/resolv.conf to send queries to a remote system of ours, and the errors have (mostly) stopped. I also notice some traces take an abnormal amount of time. Ping times to google.com are less than 20ms, but this trace shows reaching the root servers takes 104ms: # dig +trace +nodnssec google.com ; <<>> DiG 9.11.4-P1-RedHat-9.11.4-5.P1.fc28 <<>> +trace +nodnssec google.com ;; global options: +cmd . 3451IN NS g.root-servers.net. . 3451IN NS k.root-servers.net. . 3451IN NS j.root-servers.net. . 3451IN NS c.root-servers.net. . 3451IN NS i.root-servers.net. . 3451IN NS e.root-servers.net. . 3451IN NS m.root-servers.net. . 3451IN NS l.root-servers.net. . 3451IN NS a.root-servers.net. . 3451IN NS h.root-servers.net. . 3451IN NS b.root-servers.net. . 3451IN NS d.root-servers.net. . 3451IN NS f.root-servers.net. ;; Received 839 bytes from 127.0.0.1#53(127.0.0.1) in 0 ms com.172800 IN NS h.gtld-servers.net. com.172800 IN NS g.gtld-servers.net. com.172800 IN NS b.gtld-servers.net. com.172800 IN NS j.gtld-servers.net. com.172800 IN NS f.gtld-servers.net. com.172800 IN NS m.gtld-servers.net. com.172800 IN NS c.gtld-servers.net. com.172800 IN NS d.gtld-servers.net. com.172800 IN NS k.gtld-servers.net. com.172800 IN NS i.gtld-servers.net. com.172800 IN NS l.gtld-servers.net. com.172800 IN NS a.gtld-servers.net. com.172800 IN NS e.gtld-servers.net. ;; Received 835 bytes from 202.12.27.33#53(m.root-servers.net) in 104 ms google.com. 172800 IN NS ns2.google.com. google.com. 172800 IN NS ns1.google.com. google.com. 172800 IN NS ns3.google.com. google.com. 172800 IN NS ns4.google.com. ;; Received 287 bytes from 192.33.14.30#53(b.gtld-servers.net) in 44 ms ;; expected opt record in response google.com. 300 IN A 172.217.10.14 ;; Received 44 bytes from 216.239.36.10#53(ns3.google.com) in 29 ms Running the same trace again showed 129ms. I also located this warning: 06-Sep-2018 12:03:33.304 client: warning: client @0x7f502c1d3d50 127.0.0.1#60968 (cmail20.com.multi.surbl.org): recursive-clients soft limit exceeded (901/900/1000), aborting oldest query I've increased recursive-clients to 2500 but the SERVFAIL errors continue. There are also a ton of lame-server entries, many of which are related to one RBL or another, as part of my postscreen config: 06-Sep-2018 13:16:50.686 lame-servers: info: connection refused resolving '48.167.85.209.zz.countries.nerd.dk/A/IN': 195.182.36.121#53 06-Sep-2018 13:16:50.706 lame-servers: info: connection refused resolving '48.167.85.209.bb.barracudacentral.org/A/IN': 64.235.154.72#53 06-Sep-2018 13:16:51.308 lame-servers: info: connection refused resolving '48.167.85.209.bl.blocklist.de/A/IN': 185.21.103.31#53 06-Sep-2018 13:16:54.798 lame-servers: info: connection refused resolving 'e51dd24f684d212a7da1119b23603b0f.generic.ixhash.net/A/IN': 178.254.39.16#53 06-Sep-2018 13:16:54.799 lame-servers: info: connection refused resolving 'f4d997d8949e6dbd30f6a418ad364589.generic.ixhash.net/A/IN': 178.254.39.16#53 06-Sep-2018 13:16:55.762 lame-se
Re: Frequent timeout
Hi, On Mon, Sep 3, 2018 at 12:45 PM Carl Byington wrote: > > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA512 > > On Sun, 2018-09-02 at 21:54 -0400, Alex wrote: > > Do you have any other ideas on how I can isolate this problem? > > Run tcpdump on the external ethernet connection. > > tcpdump -s0 -vv -i %s -nn -w /tmp/outputfile udp dst port domain I've captured some packets that I believe include the packets relating to the SERVFAIL errors I've been receiving. Now I have to figure out how to go through them. In the meantime, I've configured /etc/resolv.conf to send queries to a remote system of ours, and the errors have (mostly) stopped. I also notice some traces take an abnormal amount of time. Ping times to google.com are less than 20ms, but this trace shows reaching the root servers takes 104ms: # dig +trace +nodnssec google.com ; <<>> DiG 9.11.4-P1-RedHat-9.11.4-5.P1.fc28 <<>> +trace +nodnssec google.com ;; global options: +cmd . 3451IN NS g.root-servers.net. . 3451IN NS k.root-servers.net. . 3451IN NS j.root-servers.net. . 3451IN NS c.root-servers.net. . 3451IN NS i.root-servers.net. . 3451IN NS e.root-servers.net. . 3451IN NS m.root-servers.net. . 3451IN NS l.root-servers.net. . 3451IN NS a.root-servers.net. . 3451IN NS h.root-servers.net. . 3451IN NS b.root-servers.net. . 3451IN NS d.root-servers.net. . 3451IN NS f.root-servers.net. ;; Received 839 bytes from 127.0.0.1#53(127.0.0.1) in 0 ms com.172800 IN NS h.gtld-servers.net. com.172800 IN NS g.gtld-servers.net. com.172800 IN NS b.gtld-servers.net. com.172800 IN NS j.gtld-servers.net. com.172800 IN NS f.gtld-servers.net. com.172800 IN NS m.gtld-servers.net. com.172800 IN NS c.gtld-servers.net. com.172800 IN NS d.gtld-servers.net. com.172800 IN NS k.gtld-servers.net. com.172800 IN NS i.gtld-servers.net. com.172800 IN NS l.gtld-servers.net. com.172800 IN NS a.gtld-servers.net. com.172800 IN NS e.gtld-servers.net. ;; Received 835 bytes from 202.12.27.33#53(m.root-servers.net) in 104 ms google.com. 172800 IN NS ns2.google.com. google.com. 172800 IN NS ns1.google.com. google.com. 172800 IN NS ns3.google.com. google.com. 172800 IN NS ns4.google.com. ;; Received 287 bytes from 192.33.14.30#53(b.gtld-servers.net) in 44 ms ;; expected opt record in response google.com. 300 IN A 172.217.10.14 ;; Received 44 bytes from 216.239.36.10#53(ns3.google.com) in 29 ms Running the same trace again showed 129ms. I also located this warning: 06-Sep-2018 12:03:33.304 client: warning: client @0x7f502c1d3d50 127.0.0.1#60968 (cmail20.com.multi.surbl.org): recursive-clients soft limit exceeded (901/900/1000), aborting oldest query I've increased recursive-clients to 2500 but the SERVFAIL errors continue. There are also a ton of lame-server entries, many of which are related to one RBL or another, as part of my postscreen config: 06-Sep-2018 13:16:50.686 lame-servers: info: connection refused resolving '48.167.85.209.zz.countries.nerd.dk/A/IN': 195.182.36.121#53 06-Sep-2018 13:16:50.706 lame-servers: info: connection refused resolving '48.167.85.209.bb.barracudacentral.org/A/IN': 64.235.154.72#53 06-Sep-2018 13:16:51.308 lame-servers: info: connection refused resolving '48.167.85.209.bl.blocklist.de/A/IN': 185.21.103.31#53 06-Sep-2018 13:16:54.798 lame-servers: info: connection refused resolving 'e51dd24f684d212a7da1119b23603b0f.generic.ixhash.net/A/IN': 178.254.39.16#53 06-Sep-2018 13:16:54.799 lame-servers: info: connection refused resolving 'f4d997d8949e6dbd30f6a418ad364589.generic.ixhash.net/A/IN': 178.254.39.16#53 06-Sep-2018 13:16:55.762 lame-servers: info: connection refused resolving '2.164.177.209.bb.barracudacentral.org/A/IN': 64.235.145.15#53 06-Sep-2018 13:16:55.845 lame-servers: info: connection refused resolving '2.164.177.209.bb.barracudacentral.org/A/IN': 64.235.154.72#53 What would be a cause of such a significant delay in reaching the root servers? Thanks, Alex ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users
Re: Frequent timeout
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 On Sun, 2018-09-02 at 21:54 -0400, Alex wrote: > Do you have any other ideas on how I can isolate this problem? Run tcpdump on the external ethernet connection. tcpdump -s0 -vv -i %s -nn -w /tmp/outputfile udp dst port domain -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.14 (GNU/Linux) iEYEAREKAAYFAluNZQ0ACgkQL6j7milTFsHM0QCfTT9yW9h1IyxI2esJxg5DA3Oh 2XIAn2Td8+gFoNYspGlup+kwHCd0irlV =0+d4 -END PGP SIGNATURE- ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Frequent timeout
Hi, > > When trying to resolve any of these manually, it just returns > > NXDOMAIN. > > What does >dig -4 71.161.85.209.hostkarma.junkemailfilter.com +trace +nodnssec > show, and it is consistently NXDOMAIN? That ends here with: > > 71.161.85.209.hostkarma.junkemailfilter.com. 2100 IN A 127.0.0.3 > 71.161.85.209.hostkarma.junkemailfilter.com. 2100 IN A 127.0.1.1 > ;; Received 93 bytes from 184.105.182.249#53(rbl1.junkemailfilter.com) > in 20 ms It shows the same here now, at least for the ones which resolve. Others still return NXDOMAIN. I was previously just using "host", but I suppose it's also possible that's one I didn't do. It's also possible they're no longer blacklisted by these RBLs. My point was that none of them returned SERVFAIL. I thought using dig or host to try and resolve the hosts would return the same SERVFAIL when run manually as they did by the bind resolver. What could be different that resulted in what appeared to be the majority of queries to return SERVFAIL in the named.debug.log at the time the mail was received? Would high network utilization cause that? I assume that would cause the timeout, but how can I be sure? Isn't ethernet designed to communicate that at the lower levels to prevent that kind of thing from occurring? Is there a bind configuration that would make it more resilient? > > I also isolated a packet with the "server failure" information, but > > I'm unable to figure out what the data means. Would someone be > > interested in evaluating it for me? It's a 146-byte pcap file. > > https://drive.google.com/open?id=1Ui893Lg61psZCR8I_9SJtNqs-Sil_br > > That is just the reply from bind to some other process running on the > same machine, reporting the server failure. Oh, right, because it's over loopback. This is probably from postfix's postscreen that's doing the querying. This is not the same as one of the SERVFAIL entries from named.debug.log? Do you have any other ideas on how I can isolate this problem? ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Frequent timeout
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 On Sat, 2018-09-01 at 23:45 -0400, Alex wrote: > (71.161.85.209.hostkarma.junkemailfilter.com): query failed (SERVFAIL) > (71.161.85.209.bl.score.senderscore.com): query failed (SERVFAIL) > When trying to resolve any of these manually, it just returns > NXDOMAIN. What does dig -4 71.161.85.209.hostkarma.junkemailfilter.com +trace +nodnssec show, and it is consistently NXDOMAIN? That ends here with: 71.161.85.209.hostkarma.junkemailfilter.com. 2100 IN A 127.0.0.3 71.161.85.209.hostkarma.junkemailfilter.com. 2100 IN A 127.0.1.1 ;; Received 93 bytes from 184.105.182.249#53(rbl1.junkemailfilter.com) in 20 ms > I also isolated a packet with the "server failure" information, but > I'm unable to figure out what the data means. Would someone be > interested in evaluating it for me? It's a 146-byte pcap file. > https://drive.google.com/open?id=1Ui893Lg61psZCR8I_9SJtNqs-Sil_br That is just the reply from bind to some other process running on the same machine, reporting the server failure. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.14 (GNU/Linux) iEYEAREKAAYFAluMefIACgkQL6j7milTFsETsgCgiUbEZtaS2BnRHP4VPh4ycfhF UvwAnitRg/6OCRXvZsj9EJTygjol7M+u =2DAt -END PGP SIGNATURE- ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Frequent timeout
Hi, It was reported there was a permissions problem with my Google Drive link to the pcap file only allowing access to Google users. This should now be public: https://drive.google.com/file/d/1Ui893Lg61psZCR8I_9SJtNqs-Sil_br5/view?usp=sharing Thanks, Alex On Sat, Sep 1, 2018 at 11:45 PM Alex wrote: > > On Sat, Sep 1, 2018 at 11:25 PM Carl Byington wrote: > > > > -BEGIN PGP SIGNED MESSAGE- > > Hash: SHA512 > > > > On Fri, 2018-08-31 at 17:18 -0400, Alex wrote: > > > ../../../lib/dns/resolver.c:3927 for support.coxbusiness.com/A in > > > > After 4 seconds, I get SERVFAIL on that name. > > Thank you for your help. Perhaps I picked a bad example? > > I happened to have a grep running against my current named.debug.log, > and as I received your email, what I believe is a much more > representative display of the problem occurred. I also have a packet > capture below. > > It's probably mangled posting it here, so I'll also put it on > pastebin, but it's a rapid-fire display of a series of failed queries > at once. I've cut out much of the info preceding and following to make > it more clear here. These all occurred in less than 20ms of each > other. > > (71.161.85.209.ubl.unsubscore.com): query failed (SERVFAIL) > (71.161.85.209.dnsbl-2.uceprotect.net): query failed (SERVFAIL) > (71.161.85.209.dnsbl.sorbs.net): query failed (SERVFAIL) > (71.161.85.209.bad.psky.me): query failed (SERVFAIL) > (71.161.85.209.score.senderscore.com): query failed (SERVFAIL) > (71.161.85.209.list.dnswl.org): query failed (SERVFAIL) > (71.161.85.209.zz.countries.nerd.dk): query failed (SERVFAIL) > (71.161.85.209.cidr.bl.mcafee.com): query failed (SERVFAIL) > (71.161.85.209.bl.mailspike.net): query failed (SERVFAIL) > (71.161.85.209.wl.mailspike.net): query failed (SERVFAIL) > (71.161.85.209.db.wpbl.info): query failed (SERVFAIL) > (71.161.85.209.sip.helpfulblacklist.xyz): query failed (SERVFAIL) > (71.161.85.209.dnsbl-3.uceprotect.net): query failed (SERVFAIL) > (71.161.85.209.backscatter.spameatingmonkey.net): query failed (SERVFAIL) > (71.161.85.209.hostkarma.junkemailfilter.com): query failed (SERVFAIL) > (71.161.85.209.bl.score.senderscore.com): query failed (SERVFAIL) > > When trying to resolve any of these manually, it just returns NXDOMAIN. > > See the entirety of the log here: > https://pastebin.com/JpHCDdQs > > Each of the lines above also has a corresponding entry like this: > > 01-Sep-2018 23:31:06.701 query-errors: debug 2: fetch completed at > ../../../lib/dns/resolver.c:3927 for 71.161.85.209.bad.psky.me/A in > 10.78: timed out/success > [domain:psky.me,referral:0,restart:4,qrysent:8,timeout:7,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0] > > I also isolated a packet with the "server failure" information, but > I'm unable to figure out what the data means. Would someone be > interested in evaluating it for me? It's a 146-byte pcap file. > https://drive.google.com/open?id=1Ui893Lg61psZCR8I_9SJtNqs-Sil_br > > Thanks for any ideas. > Alex ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Frequent timeout
On Sat, Sep 1, 2018 at 11:25 PM Carl Byington wrote: > > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA512 > > On Fri, 2018-08-31 at 17:18 -0400, Alex wrote: > > ../../../lib/dns/resolver.c:3927 for support.coxbusiness.com/A in > > After 4 seconds, I get SERVFAIL on that name. Thank you for your help. Perhaps I picked a bad example? I happened to have a grep running against my current named.debug.log, and as I received your email, what I believe is a much more representative display of the problem occurred. I also have a packet capture below. It's probably mangled posting it here, so I'll also put it on pastebin, but it's a rapid-fire display of a series of failed queries at once. I've cut out much of the info preceding and following to make it more clear here. These all occurred in less than 20ms of each other. (71.161.85.209.ubl.unsubscore.com): query failed (SERVFAIL) (71.161.85.209.dnsbl-2.uceprotect.net): query failed (SERVFAIL) (71.161.85.209.dnsbl.sorbs.net): query failed (SERVFAIL) (71.161.85.209.bad.psky.me): query failed (SERVFAIL) (71.161.85.209.score.senderscore.com): query failed (SERVFAIL) (71.161.85.209.list.dnswl.org): query failed (SERVFAIL) (71.161.85.209.zz.countries.nerd.dk): query failed (SERVFAIL) (71.161.85.209.cidr.bl.mcafee.com): query failed (SERVFAIL) (71.161.85.209.bl.mailspike.net): query failed (SERVFAIL) (71.161.85.209.wl.mailspike.net): query failed (SERVFAIL) (71.161.85.209.db.wpbl.info): query failed (SERVFAIL) (71.161.85.209.sip.helpfulblacklist.xyz): query failed (SERVFAIL) (71.161.85.209.dnsbl-3.uceprotect.net): query failed (SERVFAIL) (71.161.85.209.backscatter.spameatingmonkey.net): query failed (SERVFAIL) (71.161.85.209.hostkarma.junkemailfilter.com): query failed (SERVFAIL) (71.161.85.209.bl.score.senderscore.com): query failed (SERVFAIL) When trying to resolve any of these manually, it just returns NXDOMAIN. See the entirety of the log here: https://pastebin.com/JpHCDdQs Each of the lines above also has a corresponding entry like this: 01-Sep-2018 23:31:06.701 query-errors: debug 2: fetch completed at ../../../lib/dns/resolver.c:3927 for 71.161.85.209.bad.psky.me/A in 10.78: timed out/success [domain:psky.me,referral:0,restart:4,qrysent:8,timeout:7,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0] I also isolated a packet with the "server failure" information, but I'm unable to figure out what the data means. Would someone be interested in evaluating it for me? It's a 146-byte pcap file. https://drive.google.com/open?id=1Ui893Lg61psZCR8I_9SJtNqs-Sil_br Thanks for any ideas. Alex ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Frequent timeout
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 On Fri, 2018-08-31 at 17:18 -0400, Alex wrote: > ../../../lib/dns/resolver.c:3927 for support.coxbusiness.com/A in After 4 seconds, I get SERVFAIL on that name. > ../../../lib/dns/resolver.c:3927 for dell.ns.cloudflare.com/A in That name resolves here very quickly. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.14 (GNU/Linux) iEYEAREKAAYFAluLV+AACgkQL6j7milTFsGAhwCfYmXS+l5XK0dl8oMDniz/eVIn MXcAn0Com++6PPkec7Cb7GS6qvBjai8b =AnFC -END PGP SIGNATURE- ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Frequent timeout
Hi, Alex-- On Aug 31, 2018, at 3:49 PM, Alex wrote: > The interface does show some packet loss: > > br0: flags=4163 mtu 1500 > [ ... ] >RX packets 1610535 bytes 963148307 (918.5 MiB) >RX errors 0 dropped 5066 overruns 0 frame 0 > > Is some packet loss such as the above to be expected? I recall doing > some network tests some time ago and found much of it was IPv6 > traffic, which is not being used. 0.3% dropped packets is a bit unusual for a NIC running against a switch; it would be quite normal for a hub. However, Linux tends to also count various things like unknown VLAN tags, unknown protocols (ie, IPv6 traffic on an IPv4-only host), etc as dropped RX packets. Supposedly ethtool -S helps distinguish between actual interface errors and traffic that your machine chooses to drop. Regards, -- -Chuck ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Frequent timeout
Hi, On Fri, Aug 31, 2018 at 5:54 PM Darcy, Kevin wrote: > > I'll second the use of tcpdump, and also add that DNS query traffic, using > UDP by default, tends to be hypersensitive to packet loss. TCP will retry and > folks may not even notice a slight drop in performance, but DNS queries, > under the same conditions, can fail completely. Thus, DNS is often the > "canary in the coal mine" for conditions which lead to packet loss, sometimes > even an early warning of developing WAN and/or configuration issues. Thanks so much for your help. I have some familiarity with tcpdump and will investigate. The interface does show some packet loss: br0: flags=4163 mtu 1500 inet 68.195.193.45 netmask 255.255.255.248 broadcast 68.195.193.47 inet6 fe80::16da:e9ff:fe97:ab71 prefixlen 64 scopeid 0x20 inet6 ::16da:e9ff:fe97:ab71 prefixlen 64 scopeid 0x0 ether 14:da:e9:97:ab:71 txqueuelen 1000 (Ethernet) RX packets 1610535 bytes 963148307 (918.5 MiB) RX errors 0 dropped 5066 overruns 0 frame 0 TX packets 1958053 bytes 1243814299 (1.1 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 # uptime 18:45:08 up 2:49, 1 user, load average: 0.46, 0.53, 0.66 Is some packet loss such as the above to be expected? I recall doing some network tests some time ago and found much of it was IPv6 traffic, which is not being used. bind is running on localhost, so I will trace packets there, but what am I looking for, to suspect it's a network problem? Will the normal tcpdump packet size defaults suffice, or should I be capturing larger amounts from each packet? This is what I'll be doing for Labor Day weekend, so any help would really be appreciated. Cablevision/Optonline has told me there are no problems, but their tests aren't very thorough - if ping works and doesn't drop packets at that particular time, the link must be fine. Thanks, Alex > > > - Kevin > > On Fri, Aug 31, 2018 at 5:36 PM John W. Blue via bind-users > wrote: >> >> tcpdump is your newest best friend to troubleshoot network issues. You need >> to see what (if anything) is being placed on the wire and the responses (if >> any). My goto syntax is: >> >> tcpdump -n -i eth0 port domain >> >> I like -n because it prevents a PTR lookup from happing. Why add extra >> noise? As with anything troubleshooting related it is a process of >> elimination. >> >> Good hunting! >> >> John >> >> Sent from Nine >> >> From: Alex >> Sent: Friday, August 31, 2018 4:20 PM >> To: bind-users@lists.isc.org >> Subject: Frequent timeout >> >> Hi, >> >> Would someone please help me understand why I'm receiving so many >> timeouts? This is on a fedora28 system with bind-9.11.4 acting as a >> mail server and running on a cable modem. >> >> It appears to happen during all times, including when the link is >> otherwise idle. >> >> 31-Aug-2018 16:52:57.297 query-errors: debug 2: fetch completed at >> ../../../lib/dns/resolver.c:3927 for support.coxbusiness.com/A in >> 10.000171: timed out/success >> [domain:support.coxbusiness.com,referral:2,restart:4,qrysent:5,timeout:4,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0] >> >> 31-Aug-2018 17:06:42.655 query-errors: debug 2: fetch completed at >> ../../../lib/dns/resolver.c:3927 for dell.ns.cloudflare.com/A in >> 10.000108: timed out/success >> [domain:cloudflare.com,referral:0,restart:2,qrysent:13,timeout:12,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0] >> >> What more information can I provide to troubleshoot this? >> >> Is it possible that even though the link otherwise seems to be >> operating okay that there could still be some problem that would >> affect DNS traffic? >> >> I've also clear all firewall rules, and it's not even all queries which fail. >> >> Thanks, >> Alex >> ___ >> Please visit https://lists.isc.org/mailman/listinfo/bind-users to >> unsubscribe from this list >> >> bind-users mailing list >> bind-users@lists.isc.org >> https://lists.isc.org/mailman/listinfo/bind-users >> ___ >> Please visit https://lists.isc.org/mailman/listinfo/bind-users to >> unsubscribe from this list >> >> bind-users mailing list >> bind-users@lists.isc.org >> https://lists.isc.org/mailman/listinfo/bind-users > > _
Re: Frequent timeout
I'll second the use of tcpdump, and also add that DNS query traffic, using UDP by default, tends to be hypersensitive to packet loss. TCP will retry and folks may not even notice a slight drop in performance, but DNS queries, under the same conditions, can fail completely. Thus, DNS is often the "canary in the coal mine" for conditions which lead to packet loss, sometimes even an early warning of developing WAN and/or configuration issues. - Kevin On Fri, Aug 31, 2018 at 5:36 PM John W. Blue via bind-users < bind-users@lists.isc.org> wrote: > tcpdump is your newest best friend to troubleshoot network issues. You > need to see what (if anything) is being placed on the wire and the > responses (if any). My goto syntax is: > > tcpdump -n -i eth0 port domain > > I like -n because it prevents a PTR lookup from happing. Why add extra > noise? As with anything troubleshooting related it is a process of > elimination. > > Good hunting! > > John > > Sent from Nine <http://www.9folders.com/> > -- > *From:* Alex > *Sent:* Friday, August 31, 2018 4:20 PM > *To:* bind-users@lists.isc.org > *Subject:* Frequent timeout > > Hi, > > Would someone please help me understand why I'm receiving so many > timeouts? This is on a fedora28 system with bind-9.11.4 acting as a > mail server and running on a cable modem. > > It appears to happen during all times, including when the link is > otherwise idle. > > 31-Aug-2018 16:52:57.297 query-errors: debug 2: fetch completed at > ../../../lib/dns/resolver.c:3927 for support.coxbusiness.com/A in > 10.000171: timed out/success > [domain:support.coxbusiness.com > ,referral:2,restart:4,qrysent:5,timeout:4,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0] > > 31-Aug-2018 17:06:42.655 query-errors: debug 2: fetch completed at > ../../../lib/dns/resolver.c:3927 for dell.ns.cloudflare.com/A in > 10.000108: timed out/success > [domain:cloudflare.com > ,referral:0,restart:2,qrysent:13,timeout:12,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0] > > What more information can I provide to troubleshoot this? > > Is it possible that even though the link otherwise seems to be > operating okay that there could still be some problem that would > affect DNS traffic? > > I've also clear all firewall rules, and it's not even all queries which > fail. > > Thanks, > Alex > ___ > Please visit https://lists.isc.org/mailman/listinfo/bind-users to > unsubscribe from this list > > bind-users mailing list > bind-users@lists.isc.org > https://lists.isc.org/mailman/listinfo/bind-users > ___ > Please visit https://lists.isc.org/mailman/listinfo/bind-users to > unsubscribe from this list > > bind-users mailing list > bind-users@lists.isc.org > https://lists.isc.org/mailman/listinfo/bind-users > ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Frequent timeout
tcpdump is your newest best friend to troubleshoot network issues. You need to see what (if anything) is being placed on the wire and the responses (if any). My goto syntax is: tcpdump -n -i eth0 port domain I like -n because it prevents a PTR lookup from happing. Why add extra noise? As with anything troubleshooting related it is a process of elimination. Good hunting! John Sent from Nine<http://www.9folders.com/> From: Alex Sent: Friday, August 31, 2018 4:20 PM To: bind-users@lists.isc.org Subject: Frequent timeout Hi, Would someone please help me understand why I'm receiving so many timeouts? This is on a fedora28 system with bind-9.11.4 acting as a mail server and running on a cable modem. It appears to happen during all times, including when the link is otherwise idle. 31-Aug-2018 16:52:57.297 query-errors: debug 2: fetch completed at ../../../lib/dns/resolver.c:3927 for support.coxbusiness.com/A in 10.000171: timed out/success [domain:support.coxbusiness.com,referral:2,restart:4,qrysent:5,timeout:4,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0] 31-Aug-2018 17:06:42.655 query-errors: debug 2: fetch completed at ../../../lib/dns/resolver.c:3927 for dell.ns.cloudflare.com/A in 10.000108: timed out/success [domain:cloudflare.com,referral:0,restart:2,qrysent:13,timeout:12,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0] What more information can I provide to troubleshoot this? Is it possible that even though the link otherwise seems to be operating okay that there could still be some problem that would affect DNS traffic? I've also clear all firewall rules, and it's not even all queries which fail. Thanks, Alex ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Frequent timeout
Hi, Would someone please help me understand why I'm receiving so many timeouts? This is on a fedora28 system with bind-9.11.4 acting as a mail server and running on a cable modem. It appears to happen during all times, including when the link is otherwise idle. 31-Aug-2018 16:52:57.297 query-errors: debug 2: fetch completed at ../../../lib/dns/resolver.c:3927 for support.coxbusiness.com/A in 10.000171: timed out/success [domain:support.coxbusiness.com,referral:2,restart:4,qrysent:5,timeout:4,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0] 31-Aug-2018 17:06:42.655 query-errors: debug 2: fetch completed at ../../../lib/dns/resolver.c:3927 for dell.ns.cloudflare.com/A in 10.000108: timed out/success [domain:cloudflare.com,referral:0,restart:2,qrysent:13,timeout:12,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0] What more information can I provide to troubleshoot this? Is it possible that even though the link otherwise seems to be operating okay that there could still be some problem that would affect DNS traffic? I've also clear all firewall rules, and it's not even all queries which fail. Thanks, Alex ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users