RE: Frequent timeout

2018-09-11 Thread John W. Blue
I will walk back my previous comments and just say that bandwidth may be in 
play because anytime you soak a circuit it is not good.

Take a look at this query sequence:

dns.qry.type == 28 && dns.qry.name == concured.co

Packet 42356 shows a  query for concurred.co.
Packets 42357/8 show 68.195.193.45 relaying the query to 62.138.132.21.
Packets 43015/16 show 62.138.132.21 replying with its query response to 
68.195.193.45.

And that's it.  Nothing is seen being sent back to 127.0.0.1.  At least on the 
wire.  By way of comparison, packet 161 shows 127.0.0.1 answering itself so I 
would consider the previous no response a clue.

Moving on:

Packet 48874 shows 127.0.0.1 asking for a  record again.
This time we don’t see any external communication.
Packet 87174 shows 127.0.0.1 replying with server failure.

It took nearly 25 seconds to decide upon a SERVFAIL and that is another clue.

That said, there a heaps of queries where DNS worked as expected.  I really had 
to dig for the above examples because it seems like the vast majority of the 
server failure messages either do not get a reply on the localhost or we don’t 
see the routable adapter on the server attempting to reach out to get the 
answer.  concurred.co is unique in that we see that attempt to reach out and no 
attempt.

If the traffic that 127.0.0.1 is putting on the wire does not go out I am 
thinking firewall but you may be dealing with bandwidth exhaustion exclusively 
and it is presenting itself in this manner.  Or you may have a server 
configuration issues or a server that is under powered.

Sometimes pcap's are black and white and it gives you a "here is your problem" 
answer and other times it is like this where it does not give us anything 
conclusively to work with.  Since this sever is sputtering around I would set 
about first stabilizing traffic from 127.0.0.1 going out.  You need to see 
outbound traffic hit 127.0.0.1 then hit your external adapter without missing.  
Boom, boom, boom on down the line.

Hopefully others may have better more insightful suggestions.

Good hunting!

John

-Original Message-
From: Alex [mailto:mysqlstud...@gmail.com] 
Sent: Tuesday, September 11, 2018 1:57 PM
To: John W. Blue; bind-users@lists.isc.org
Subject: Re: Frequent timeout

Hi,

On Tue, Sep 11, 2018 at 2:47 PM John W. Blue  wrote:
>
> If you use wireshark to slice n dice the pcap .. "dns.flags.rcode == 2" shows 
> all of your SERVFAIL happens on localhost.
>
> If you switch to "dns.qry.name == storage.pardot.com" every single query is 
> localhost.
>
> Unless you have another NIC that you are sending traffic over this does not 
> look like a bandwidth issue at this particular point in time.

Thanks so much. I think I also may have confused things by suggesting it was 
related to bandwidth or utilization. I see it also happen now more regularly 
too.

Can you ascertain why it is reporting these SERVFAILs?

The queries are on localhost because /etc/resolv.conf lists localhost as the 
nameserver. Is that why we can't diagnose this? This most recent packet trace 
was started with "-i any". Why would the ones on localhost be the ones which 
are failing? I'm assuming postfix and/or some other process is querying bind on 
localhost to cause these errors?
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Frequent timeout

2018-09-11 Thread Alex
Hi,

On Tue, Sep 11, 2018 at 2:47 PM John W. Blue  wrote:
>
> If you use wireshark to slice n dice the pcap .. "dns.flags.rcode == 2" shows 
> all of your SERVFAIL happens on localhost.
>
> If you switch to "dns.qry.name == storage.pardot.com" every single query is 
> localhost.
>
> Unless you have another NIC that you are sending traffic over this does not 
> look like a bandwidth issue at this particular point in time.

Thanks so much. I think I also may have confused things by suggesting
it was related to bandwidth or utilization. I see it also happen now
more regularly too.

Can you ascertain why it is reporting these SERVFAILs?

The queries are on localhost because /etc/resolv.conf lists localhost
as the nameserver. Is that why we can't diagnose this? This most
recent packet trace was started with "-i any". Why would the ones on
localhost be the ones which are failing? I'm assuming postfix and/or
some other process is querying bind on localhost to cause these
errors?
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


RE: Frequent timeout

2018-09-11 Thread John W. Blue
If you use wireshark to slice n dice the pcap .. "dns.flags.rcode == 2" shows 
all of your SERVFAIL happens on localhost.

If you switch to "dns.qry.name == storage.pardot.com" every single query is 
localhost.

Unless you have another NIC that you are sending traffic over this does not 
look like a bandwidth issue at this particular point in time.

John

-Original Message-
From: Alex [mailto:mysqlstud...@gmail.com] 
Sent: Tuesday, September 11, 2018 1:19 PM
To: bind-users@lists.isc.org; John W. Blue
Subject: Re: Frequent timeout

Hi,

Here is a much more reasonable network capture during the period where there 
are numerous SERVFAIL errors from bind over a short period of high utilization.
https://drive.google.com/file/d/1UrzvB-pumVjPvlmd6ZSnHi-XVynI8y3y/view?usp=sharing

This is when our 20mbs cable upstream link was saturated and resulted in DNS 
query timeout errors. resulting in these SERVFAIL messages.

The packet trace shows multiple TCP out-of-order and TCP Dup ACK packets. Would 
these retransmits cause enough of a delay for the queries to fail?

Would someone more knowledgeable look into these packet errors for me?

It might seem obvious that we should increase the bandwidth of our link, since 
it occurs during periods of high utilization, but it doesn't occur on our other 
10mbs DIA links in the datacenter when the link is saturated.

11-Sep-2018 11:53:25.692 query-errors: info: client @0x7fc7ef343740
127.0.0.1#50821
(8cb54bfffc54eee06342d5619246d67166abc6cf.ebl.msbl.org): query failed
(SERVFAIL) for 8cb54bfffc54eee06342d5619246d67166abc6cf.ebl.msbl.org/IN/A
at ../../../bin/named/query.c:8580

11-Sep-2018 11:53:25.687 query-errors: debug 2: fetch completed at
../../../lib/dns/resolver.c:3927 for
ac949d5d947f8f5cad13e98c68bac6f284c367fd.ebl.msbl.org/A in 30.84:
timed out/success
[domain:ebl.msbl.org,referral:0,restart:6,qrysent:11,timeout:10,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0]

Thanks,
Alex

On Mon, Sep 10, 2018 at 12:11 PM Alex  wrote:
>
> Hi,
>
> > >> tcpdump -s0 -n -i eth0 port domain -w /tmp/domaincapture.pcap
> > >>
> > >> You don't need all of the extra stuff because -s0 captures the full 
> > >> packet.
> >
> > On 06.09.18 18:42, Alex wrote:
> > >This is the command I ran to produce the pcap file I sent:
> > >
> > ># tcpdump -s0 -vv -i eth0 -nn -w domain-capture-eth0-090518.pcap 
> > >udp dst port domain
> >
> > and that is the problem. "dst port domain" captures packets going to 
> > DNS servers, not responses coming back.
> >
> > "-vv" and "-nn" are useless when producing packet capture and "-s0" 
> > is default for some time. I often add "-U" so file is flushed wich each 
> > packet.
> >
> > you can strip incoming queries by using filter
> >
> > "(src host 68.195.XXX.45 and dst port domain) or (src port domain and dst 
> > host 68.195.XXX.45)"
>
> I've generated a new tcpdump file using these criteria and uploaded it here:
> https://drive.google.com/file/d/1F0VML8yPZJbcDZTys2hXDhjzv1UaBHuV/view
> ?usp=sharing
>
> The SERVFAIL errors didn't really occur over the weekend. I believe it 
> has something to do with mail volume, link congestion/bandwidth 
> utilization.
>
> Thanks,
> Alex
>
>
>
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Frequent timeout

2018-09-11 Thread Carl Byington
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

On Tue, 2018-09-11 at 14:19 -0400, Alex wrote:
> This is when our 20mbs cable upstream link was saturated and resulted
> in DNS query timeout errors. resulting in these SERVFAIL messages.

Not specific to dns, but this looks like a bufferbloat problem, which is
common with cable modems. When the upstream link is saturated, the
buffers in the interface device (cable modem or possibly a standalone
router) become full. If there is a lot of buffer space, the latency
becomes very large, and that will cause many problems, including issues
with dns. A partial fix is to prioritize small packets like dns queries
and tcp acks, so they don't wait behind a large queue of full size
packets. A more complete fix is switching to fq-codel queue discipline.

google for bufferbloat for more details.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.14 (GNU/Linux)

iEYEAREKAAYFAluYDHMACgkQL6j7milTFsEqXwCffaR+fwcqpoEHPisw86Q49+Kw
o0cAn0Q5LV1FXk2r1fiTqYZIlsa9xH3s
=yp3H
-END PGP SIGNATURE-


___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Frequent timeout

2018-09-11 Thread Alex
Hi,

Here is a much more reasonable network capture during the period where
there are numerous SERVFAIL errors from bind over a short period of
high utilization.
https://drive.google.com/file/d/1UrzvB-pumVjPvlmd6ZSnHi-XVynI8y3y/view?usp=sharing

This is when our 20mbs cable upstream link was saturated and resulted
in DNS query timeout errors. resulting in these SERVFAIL messages.

The packet trace shows multiple TCP out-of-order and TCP Dup ACK
packets. Would these retransmits cause enough of a delay for the
queries to fail?

Would someone more knowledgeable look into these packet errors for me?

It might seem obvious that we should increase the bandwidth of our
link, since it occurs during periods of high utilization, but it
doesn't occur on our other 10mbs DIA links in the datacenter when the
link is saturated.

11-Sep-2018 11:53:25.692 query-errors: info: client @0x7fc7ef343740
127.0.0.1#50821
(8cb54bfffc54eee06342d5619246d67166abc6cf.ebl.msbl.org): query failed
(SERVFAIL) for 8cb54bfffc54eee06342d5619246d67166abc6cf.ebl.msbl.org/IN/A
at ../../../bin/named/query.c:8580

11-Sep-2018 11:53:25.687 query-errors: debug 2: fetch completed at
../../../lib/dns/resolver.c:3927 for
ac949d5d947f8f5cad13e98c68bac6f284c367fd.ebl.msbl.org/A in 30.84:
timed out/success
[domain:ebl.msbl.org,referral:0,restart:6,qrysent:11,timeout:10,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0]

Thanks,
Alex

On Mon, Sep 10, 2018 at 12:11 PM Alex  wrote:
>
> Hi,
>
> > >> tcpdump -s0 -n -i eth0 port domain -w /tmp/domaincapture.pcap
> > >>
> > >> You don't need all of the extra stuff because -s0 captures the full 
> > >> packet.
> >
> > On 06.09.18 18:42, Alex wrote:
> > >This is the command I ran to produce the pcap file I sent:
> > >
> > ># tcpdump -s0 -vv -i eth0 -nn -w domain-capture-eth0-090518.pcap udp
> > >dst port domain
> >
> > and that is the problem. "dst port domain" captures packets going to DNS
> > servers, not responses coming back.
> >
> > "-vv" and "-nn" are useless when producing packet capture and "-s0" is
> > default for some time. I often add "-U" so file is flushed wich each packet.
> >
> > you can strip incoming queries by using filter
> >
> > "(src host 68.195.XXX.45 and dst port domain) or (src port domain and dst 
> > host 68.195.XXX.45)"
>
> I've generated a new tcpdump file using these criteria and uploaded it here:
> https://drive.google.com/file/d/1F0VML8yPZJbcDZTys2hXDhjzv1UaBHuV/view?usp=sharing
>
> The SERVFAIL errors didn't really occur over the weekend. I believe it
> has something to do with mail volume, link congestion/bandwidth
> utilization.
>
> Thanks,
> Alex
>
>
>
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Frequent timeout

2018-09-10 Thread Alex
Hi,

> >> tcpdump -s0 -n -i eth0 port domain -w /tmp/domaincapture.pcap
> >>
> >> You don't need all of the extra stuff because -s0 captures the full packet.
>
> On 06.09.18 18:42, Alex wrote:
> >This is the command I ran to produce the pcap file I sent:
> >
> ># tcpdump -s0 -vv -i eth0 -nn -w domain-capture-eth0-090518.pcap udp
> >dst port domain
>
> and that is the problem. "dst port domain" captures packets going to DNS
> servers, not responses coming back.
>
> "-vv" and "-nn" are useless when producing packet capture and "-s0" is
> default for some time. I often add "-U" so file is flushed wich each packet.
>
> you can strip incoming queries by using filter
>
> "(src host 68.195.XXX.45 and dst port domain) or (src port domain and dst 
> host 68.195.XXX.45)"

I've generated a new tcpdump file using these criteria and uploaded it here:
https://drive.google.com/file/d/1F0VML8yPZJbcDZTys2hXDhjzv1UaBHuV/view?usp=sharing

The SERVFAIL errors didn't really occur over the weekend. I believe it
has something to do with mail volume, link congestion/bandwidth
utilization.

Thanks,
Alex



>
> >I should also mention that, while eth0 is the physical device, there
> >is a bridge set up to support virtual machines (none of which were
> >active). Hopefully that's not the reason! (real IP obscured).
>
> not the reason, but using "-i br0" could be safer then.
>
> Note that the IP was seen in packet capture you have published, not needed
> to hide it now.
>
> --
> Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
> Warning: I wish NOT to receive e-mail advertising to this address.
> Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
> They that can give up essential liberty to obtain a little temporary
> safety deserve neither liberty nor safety. -- Benjamin Franklin, 1759
> ___
> Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
> from this list
>
> bind-users mailing list
> bind-users@lists.isc.org
> https://lists.isc.org/mailman/listinfo/bind-users
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Frequent timeout

2018-09-07 Thread Matus UHLAR - fantomas

On Thu, Sep 6, 2018 at 5:56 PM John W. Blue  wrote:

So that file is full of nothing but queries and no responses which, sadly, is 
useless.

Run:

tcpdump -s0 -n -i eth0 port domain -w /tmp/domaincapture.pcap

You don't need all of the extra stuff because -s0 captures the full packet.


On 06.09.18 18:42, Alex wrote:

This is the command I ran to produce the pcap file I sent:

# tcpdump -s0 -vv -i eth0 -nn -w domain-capture-eth0-090518.pcap udp
dst port domain


and that is the problem. "dst port domain" captures packets going to DNS
servers, not responses coming back.

"-vv" and "-nn" are useless when producing packet capture and "-s0" is
default for some time. I often add "-U" so file is flushed wich each packet.

you can strip incoming queries by using filter

"(src host 68.195.XXX.45 and dst port domain) or (src port domain and dst host 
68.195.XXX.45)"


I should also mention that, while eth0 is the physical device, there
is a bridge set up to support virtual machines (none of which were
active). Hopefully that's not the reason! (real IP obscured).


not the reason, but using "-i br0" could be safer then.

Note that the IP was seen in packet capture you have published, not needed
to hide it now.

--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety. -- Benjamin Franklin, 1759
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Frequent timeout

2018-09-06 Thread Alex
On Thu, Sep 6, 2018 at 5:56 PM John W. Blue  wrote:
>
> So that file is full of nothing but queries and no responses which, sadly, is 
> useless.
>
> Run:
>
> tcpdump -s0 -n -i eth0 port domain -w /tmp/domaincapture.pcap
>
> You don't need all of the extra stuff because -s0 captures the full packet.

This is the command I ran to produce the pcap file I sent:

# tcpdump -s0 -vv -i eth0 -nn -w domain-capture-eth0-090518.pcap udp
dst port domain

I have a few other pcap files here. Can you tell me the query you ran
in wireshark to search for the SERVFAIL packets? Perhaps I can find
them here. I have another that I just realized was running for quite a
while and has grown to 1.5GB until I just stopped it. I also have
another that was run with "-i any", but it's also quite large.

I'd otherwise probably have to wait until tomorrow to run it again, as
it appears to happen during periods of high traffic.

I should also mention that, while eth0 is the physical device, there
is a bridge set up to support virtual machines (none of which were
active). Hopefully that's not the reason! (real IP obscured).

br0: flags=4163  mtu 1500
inet 68.195.XXX.45  netmask 255.255.255.248  broadcast 68.195.XXX.47
inet6 fe80::16da:e9ff:fe97:ab71  prefixlen 64  scopeid 0x20
inet6 ::16da:e9ff:fe97:ab71  prefixlen 64  scopeid 0x0
ether 14:da:e9:97:ab:71  txqueuelen 1000  (Ethernet)
RX packets 54953236  bytes 45182800578 (42.0 GiB)
RX errors 0  dropped 231612  overruns 0  frame 0
TX packets 68345276  bytes 33687959055 (31.3 GiB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth0: flags=4163  mtu 1500
inet6 fe80::16da:e9ff:fe97:ab71  prefixlen 64  scopeid 0x20
ether 14:da:e9:97:ab:71  txqueuelen 1000  (Ethernet)
RX packets 61078845  bytes 46596159121 (43.3 GiB)
RX errors 0  dropped 0  overruns 0  frame 0
TX packets 68733541  bytes 34028363069 (31.6 GiB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
device interrupt 16  memory 0xdf20-df22

Thanks,
Alex


>
> John
>
> -Original Message-
> From: bind-users [mailto:bind-users-boun...@lists.isc.org] On Behalf Of Alex
> Sent: Thursday, September 06, 2018 2:54 PM
> To: bind-users@lists.isc.org
> Subject: Re: Frequent timeout
>
> On Thu, Sep 6, 2018 at 3:05 PM John W. Blue  wrote:
> >
> > Alex,
> >
> > Have you uploaded this pcap with the SERVFAIL's?  I didn't have time to 
> > look at your first upload but can review this one.
>
> Thanks very much. I've uploaded the pcap file here. It's about ~100MB 
> compressed, and represents about 4hrs of data, I believe.
> https://drive.google.com/file/d/1KUpDoQ2zuz5ITeKuO0BhlK7JvWSUAG3B/view?usp=sharing
>
> Thanks,
> Alex
> ___
> Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
> from this list
>
> bind-users mailing list
> bind-users@lists.isc.org
> https://lists.isc.org/mailman/listinfo/bind-users
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


RE: Frequent timeout

2018-09-06 Thread John W. Blue
So that file is full of nothing but queries and no responses which, sadly, is 
useless.

Run:

tcpdump -s0 -n -i eth0 port domain -w /tmp/domaincapture.pcap

You don't need all of the extra stuff because -s0 captures the full packet.

John

-Original Message-
From: bind-users [mailto:bind-users-boun...@lists.isc.org] On Behalf Of Alex
Sent: Thursday, September 06, 2018 2:54 PM
To: bind-users@lists.isc.org
Subject: Re: Frequent timeout

On Thu, Sep 6, 2018 at 3:05 PM John W. Blue  wrote:
>
> Alex,
>
> Have you uploaded this pcap with the SERVFAIL's?  I didn't have time to look 
> at your first upload but can review this one.

Thanks very much. I've uploaded the pcap file here. It's about ~100MB 
compressed, and represents about 4hrs of data, I believe.
https://drive.google.com/file/d/1KUpDoQ2zuz5ITeKuO0BhlK7JvWSUAG3B/view?usp=sharing

Thanks,
Alex
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Frequent timeout

2018-09-06 Thread Alex
On Thu, Sep 6, 2018 at 3:05 PM John W. Blue  wrote:
>
> Alex,
>
> Have you uploaded this pcap with the SERVFAIL's?  I didn't have time to look 
> at your first upload but can review this one.

Thanks very much. I've uploaded the pcap file here. It's about ~100MB
compressed, and represents about 4hrs of data, I believe.
https://drive.google.com/file/d/1KUpDoQ2zuz5ITeKuO0BhlK7JvWSUAG3B/view?usp=sharing

Thanks,
Alex



>
> John
>
> -Original Message-
> From: bind-users [mailto:bind-users-boun...@lists.isc.org] On Behalf Of Alex
> Sent: Thursday, September 06, 2018 1:49 PM
> To: c...@byington.org; bind-users@lists.isc.org
> Subject: Re: Frequent timeout
>
> Hi,
>
> On Mon, Sep 3, 2018 at 12:45 PM Carl Byington  wrote:
> >
> > -BEGIN PGP SIGNED MESSAGE-
> > Hash: SHA512
> >
> > On Sun, 2018-09-02 at 21:54 -0400, Alex wrote:
> > > Do you have any other ideas on how I can isolate this problem?
> >
> > Run tcpdump on the external ethernet connection.
> >
> > tcpdump -s0 -vv -i %s -nn -w /tmp/outputfile udp dst port domain
>
> I've captured some packets that I believe include the packets relating to the 
> SERVFAIL errors I've been receiving. Now I have to figure out how to go 
> through them.
>
> In the meantime, I've configured /etc/resolv.conf to send queries to a remote 
> system of ours, and the errors have (mostly) stopped.
>
> I also notice some traces take an abnormal amount of time. Ping times to 
> google.com are less than 20ms, but this trace shows reaching the root servers 
> takes 104ms:
>
> # dig +trace +nodnssec google.com
>
> ; <<>> DiG 9.11.4-P1-RedHat-9.11.4-5.P1.fc28 <<>> +trace +nodnssec google.com 
> ;; global options: +cmd
> .   3451IN  NS  g.root-servers.net.
> .   3451IN  NS  k.root-servers.net.
> .   3451IN  NS  j.root-servers.net.
> .   3451IN  NS  c.root-servers.net.
> .   3451IN  NS  i.root-servers.net.
> .   3451IN  NS  e.root-servers.net.
> .   3451IN  NS  m.root-servers.net.
> .   3451IN  NS  l.root-servers.net.
> .   3451IN  NS  a.root-servers.net.
> .   3451IN  NS  h.root-servers.net.
> .   3451IN  NS  b.root-servers.net.
> .   3451IN  NS  d.root-servers.net.
> .   3451IN  NS  f.root-servers.net.
> ;; Received 839 bytes from 127.0.0.1#53(127.0.0.1) in 0 ms
>
> com.172800  IN  NS  h.gtld-servers.net.
> com.172800  IN  NS  g.gtld-servers.net.
> com.172800  IN  NS  b.gtld-servers.net.
> com.172800  IN  NS  j.gtld-servers.net.
> com.172800  IN  NS  f.gtld-servers.net.
> com.172800  IN  NS  m.gtld-servers.net.
> com.172800  IN  NS  c.gtld-servers.net.
> com.172800  IN  NS  d.gtld-servers.net.
> com.172800  IN  NS  k.gtld-servers.net.
> com.172800  IN  NS  i.gtld-servers.net.
> com.172800  IN  NS  l.gtld-servers.net.
> com.172800  IN  NS  a.gtld-servers.net.
> com.172800  IN  NS  e.gtld-servers.net.
> ;; Received 835 bytes from 202.12.27.33#53(m.root-servers.net) in 104 ms
>
> google.com. 172800  IN  NS  ns2.google.com.
> google.com. 172800  IN  NS  ns1.google.com.
> google.com. 172800  IN  NS  ns3.google.com.
> google.com. 172800  IN  NS  ns4.google.com.
> ;; Received 287 bytes from 192.33.14.30#53(b.gtld-servers.net) in 44 ms
>
> ;; expected opt record in response
> google.com. 300 IN  A   172.217.10.14
> ;; Received 44 bytes from 216.239.36.10#53(ns3.google.com) in 29 ms
>
> Running the same trace again showed 129ms.
>
> I also located this warning:
> 06-Sep-2018 12:03:33.304 client: warning: client @0x7f502c1d3d50
> 127.0.0.1#60968 (cmail20.com.multi.surbl.org): recursive-clients soft limit 
> exceeded (901/900/1000), aborting oldest query
>
> I've increased recursive-clients to 2500 but the SERVFAIL errors continue.
>
> There are also a ton of lame-server entries, many of which are related to one 
> RBL or another, as part of my postscreen config:
> 06-Se

RE: Frequent timeout

2018-09-06 Thread John W. Blue
Alex,

Have you uploaded this pcap with the SERVFAIL's?  I didn't have time to look at 
your first upload but can review this one.

John

-Original Message-
From: bind-users [mailto:bind-users-boun...@lists.isc.org] On Behalf Of Alex
Sent: Thursday, September 06, 2018 1:49 PM
To: c...@byington.org; bind-users@lists.isc.org
Subject: Re: Frequent timeout

Hi,

On Mon, Sep 3, 2018 at 12:45 PM Carl Byington  wrote:
>
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA512
>
> On Sun, 2018-09-02 at 21:54 -0400, Alex wrote:
> > Do you have any other ideas on how I can isolate this problem?
>
> Run tcpdump on the external ethernet connection.
>
> tcpdump -s0 -vv -i %s -nn -w /tmp/outputfile udp dst port domain

I've captured some packets that I believe include the packets relating to the 
SERVFAIL errors I've been receiving. Now I have to figure out how to go through 
them.

In the meantime, I've configured /etc/resolv.conf to send queries to a remote 
system of ours, and the errors have (mostly) stopped.

I also notice some traces take an abnormal amount of time. Ping times to 
google.com are less than 20ms, but this trace shows reaching the root servers 
takes 104ms:

# dig +trace +nodnssec google.com

; <<>> DiG 9.11.4-P1-RedHat-9.11.4-5.P1.fc28 <<>> +trace +nodnssec google.com 
;; global options: +cmd
.   3451IN  NS  g.root-servers.net.
.   3451IN  NS  k.root-servers.net.
.   3451IN  NS  j.root-servers.net.
.   3451IN  NS  c.root-servers.net.
.   3451IN  NS  i.root-servers.net.
.   3451IN  NS  e.root-servers.net.
.   3451IN  NS  m.root-servers.net.
.   3451IN  NS  l.root-servers.net.
.   3451IN  NS  a.root-servers.net.
.   3451IN  NS  h.root-servers.net.
.   3451IN  NS  b.root-servers.net.
.   3451IN  NS  d.root-servers.net.
.   3451IN  NS  f.root-servers.net.
;; Received 839 bytes from 127.0.0.1#53(127.0.0.1) in 0 ms

com.172800  IN  NS  h.gtld-servers.net.
com.172800  IN  NS  g.gtld-servers.net.
com.172800  IN  NS  b.gtld-servers.net.
com.172800  IN  NS  j.gtld-servers.net.
com.172800  IN  NS  f.gtld-servers.net.
com.172800  IN  NS  m.gtld-servers.net.
com.172800  IN  NS  c.gtld-servers.net.
com.172800  IN  NS  d.gtld-servers.net.
com.172800  IN  NS  k.gtld-servers.net.
com.172800  IN  NS  i.gtld-servers.net.
com.172800  IN  NS  l.gtld-servers.net.
com.172800  IN  NS  a.gtld-servers.net.
com.172800  IN  NS  e.gtld-servers.net.
;; Received 835 bytes from 202.12.27.33#53(m.root-servers.net) in 104 ms

google.com. 172800  IN  NS  ns2.google.com.
google.com. 172800  IN  NS  ns1.google.com.
google.com. 172800  IN  NS  ns3.google.com.
google.com. 172800  IN  NS  ns4.google.com.
;; Received 287 bytes from 192.33.14.30#53(b.gtld-servers.net) in 44 ms

;; expected opt record in response
google.com. 300 IN  A   172.217.10.14
;; Received 44 bytes from 216.239.36.10#53(ns3.google.com) in 29 ms

Running the same trace again showed 129ms.

I also located this warning:
06-Sep-2018 12:03:33.304 client: warning: client @0x7f502c1d3d50
127.0.0.1#60968 (cmail20.com.multi.surbl.org): recursive-clients soft limit 
exceeded (901/900/1000), aborting oldest query

I've increased recursive-clients to 2500 but the SERVFAIL errors continue.

There are also a ton of lame-server entries, many of which are related to one 
RBL or another, as part of my postscreen config:
06-Sep-2018 13:16:50.686 lame-servers: info: connection refused resolving 
'48.167.85.209.zz.countries.nerd.dk/A/IN': 195.182.36.121#53
06-Sep-2018 13:16:50.706 lame-servers: info: connection refused resolving 
'48.167.85.209.bb.barracudacentral.org/A/IN':
64.235.154.72#53
06-Sep-2018 13:16:51.308 lame-servers: info: connection refused resolving 
'48.167.85.209.bl.blocklist.de/A/IN': 185.21.103.31#53
06-Sep-2018 13:16:54.798 lame-servers: info: connection refused resolving 
'e51dd24f684d212a7da1119b23603b0f.generic.ixhash.net/A/IN':
178.254.39.16#53
06-Sep-2018 13:16:54.799 lame-servers: info: connection refused resolving 
'f4d997d8949e6dbd30f6a418ad364589.generic.ixhash.net/A/IN':
178.254.39.16#53
06-Sep-2018 13:16:55.762 lame-se

Re: Frequent timeout

2018-09-06 Thread Alex
Hi,

On Mon, Sep 3, 2018 at 12:45 PM Carl Byington  wrote:
>
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA512
>
> On Sun, 2018-09-02 at 21:54 -0400, Alex wrote:
> > Do you have any other ideas on how I can isolate this problem?
>
> Run tcpdump on the external ethernet connection.
>
> tcpdump -s0 -vv -i %s -nn -w /tmp/outputfile udp dst port domain

I've captured some packets that I believe include the packets relating
to the SERVFAIL errors I've been receiving. Now I have to figure out
how to go through them.

In the meantime, I've configured /etc/resolv.conf to send queries to a
remote system of ours, and the errors have (mostly) stopped.

I also notice some traces take an abnormal amount of time. Ping times
to google.com are less than 20ms, but this trace shows reaching the
root servers takes 104ms:

# dig +trace +nodnssec google.com

; <<>> DiG 9.11.4-P1-RedHat-9.11.4-5.P1.fc28 <<>> +trace +nodnssec google.com
;; global options: +cmd
.   3451IN  NS  g.root-servers.net.
.   3451IN  NS  k.root-servers.net.
.   3451IN  NS  j.root-servers.net.
.   3451IN  NS  c.root-servers.net.
.   3451IN  NS  i.root-servers.net.
.   3451IN  NS  e.root-servers.net.
.   3451IN  NS  m.root-servers.net.
.   3451IN  NS  l.root-servers.net.
.   3451IN  NS  a.root-servers.net.
.   3451IN  NS  h.root-servers.net.
.   3451IN  NS  b.root-servers.net.
.   3451IN  NS  d.root-servers.net.
.   3451IN  NS  f.root-servers.net.
;; Received 839 bytes from 127.0.0.1#53(127.0.0.1) in 0 ms

com.172800  IN  NS  h.gtld-servers.net.
com.172800  IN  NS  g.gtld-servers.net.
com.172800  IN  NS  b.gtld-servers.net.
com.172800  IN  NS  j.gtld-servers.net.
com.172800  IN  NS  f.gtld-servers.net.
com.172800  IN  NS  m.gtld-servers.net.
com.172800  IN  NS  c.gtld-servers.net.
com.172800  IN  NS  d.gtld-servers.net.
com.172800  IN  NS  k.gtld-servers.net.
com.172800  IN  NS  i.gtld-servers.net.
com.172800  IN  NS  l.gtld-servers.net.
com.172800  IN  NS  a.gtld-servers.net.
com.172800  IN  NS  e.gtld-servers.net.
;; Received 835 bytes from 202.12.27.33#53(m.root-servers.net) in 104 ms

google.com. 172800  IN  NS  ns2.google.com.
google.com. 172800  IN  NS  ns1.google.com.
google.com. 172800  IN  NS  ns3.google.com.
google.com. 172800  IN  NS  ns4.google.com.
;; Received 287 bytes from 192.33.14.30#53(b.gtld-servers.net) in 44 ms

;; expected opt record in response
google.com. 300 IN  A   172.217.10.14
;; Received 44 bytes from 216.239.36.10#53(ns3.google.com) in 29 ms

Running the same trace again showed 129ms.

I also located this warning:
06-Sep-2018 12:03:33.304 client: warning: client @0x7f502c1d3d50
127.0.0.1#60968 (cmail20.com.multi.surbl.org): recursive-clients soft
limit exceeded (901/900/1000), aborting oldest query

I've increased recursive-clients to 2500 but the SERVFAIL errors continue.

There are also a ton of lame-server entries, many of which are related
to one RBL or another, as part of my postscreen config:
06-Sep-2018 13:16:50.686 lame-servers: info: connection refused
resolving '48.167.85.209.zz.countries.nerd.dk/A/IN': 195.182.36.121#53
06-Sep-2018 13:16:50.706 lame-servers: info: connection refused
resolving '48.167.85.209.bb.barracudacentral.org/A/IN':
64.235.154.72#53
06-Sep-2018 13:16:51.308 lame-servers: info: connection refused
resolving '48.167.85.209.bl.blocklist.de/A/IN': 185.21.103.31#53
06-Sep-2018 13:16:54.798 lame-servers: info: connection refused
resolving 'e51dd24f684d212a7da1119b23603b0f.generic.ixhash.net/A/IN':
178.254.39.16#53
06-Sep-2018 13:16:54.799 lame-servers: info: connection refused
resolving 'f4d997d8949e6dbd30f6a418ad364589.generic.ixhash.net/A/IN':
178.254.39.16#53
06-Sep-2018 13:16:55.762 lame-servers: info: connection refused
resolving '2.164.177.209.bb.barracudacentral.org/A/IN':
64.235.145.15#53
06-Sep-2018 13:16:55.845 lame-servers: info: connection refused
resolving '2.164.177.209.bb.barracudacentral.org/A/IN':
64.235.154.72#53

What would be a cause of such a significant delay in reaching the root servers?

Thanks,
Alex
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users 

Re: Frequent timeout

2018-09-03 Thread Carl Byington
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

On Sun, 2018-09-02 at 21:54 -0400, Alex wrote:
> Do you have any other ideas on how I can isolate this problem?

Run tcpdump on the external ethernet connection.

tcpdump -s0 -vv -i %s -nn -w /tmp/outputfile udp dst port domain


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.14 (GNU/Linux)

iEYEAREKAAYFAluNZQ0ACgkQL6j7milTFsHM0QCfTT9yW9h1IyxI2esJxg5DA3Oh
2XIAn2Td8+gFoNYspGlup+kwHCd0irlV
=0+d4
-END PGP SIGNATURE-


___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Frequent timeout

2018-09-02 Thread Alex
Hi,

> > When trying to resolve any of these manually, it just returns
> > NXDOMAIN.
>
> What does
>dig -4 71.161.85.209.hostkarma.junkemailfilter.com +trace +nodnssec
> show, and it is consistently NXDOMAIN? That ends here with:
>
> 71.161.85.209.hostkarma.junkemailfilter.com. 2100 IN A 127.0.0.3
> 71.161.85.209.hostkarma.junkemailfilter.com. 2100 IN A 127.0.1.1
> ;; Received 93 bytes from 184.105.182.249#53(rbl1.junkemailfilter.com)
> in 20 ms

It shows the same here now, at least for the ones which resolve.
Others still return NXDOMAIN. I was previously just using "host", but
I suppose it's also possible that's one I didn't do. It's also
possible they're no longer blacklisted by these RBLs.

My point was that none of them returned SERVFAIL. I thought using dig
or host to try and resolve the hosts would return the same SERVFAIL
when run manually as they did by the bind resolver. What could be
different that resulted in what appeared to be the majority of queries
to return SERVFAIL in the named.debug.log at the time the mail was
received?

Would high network utilization cause that? I assume that would cause
the timeout, but how can I be sure? Isn't ethernet designed to
communicate that at the lower levels to prevent that kind of thing
from occurring?

Is there a bind configuration that would make it more resilient?

> > I also isolated a packet with the "server failure" information, but
> > I'm unable to figure out what the data means. Would someone be
> > interested in evaluating it for me? It's a 146-byte pcap file.
> > https://drive.google.com/open?id=1Ui893Lg61psZCR8I_9SJtNqs-Sil_br
>
> That is just the reply from bind to some other process running on the
> same machine, reporting the server failure.

Oh, right, because it's over loopback. This is probably from postfix's
postscreen that's doing the querying.

This is not the same as one of the SERVFAIL entries from named.debug.log?

Do you have any other ideas on how I can isolate this problem?
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Frequent timeout

2018-09-02 Thread Carl Byington
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

On Sat, 2018-09-01 at 23:45 -0400, Alex wrote:


> (71.161.85.209.hostkarma.junkemailfilter.com): query failed (SERVFAIL)
> (71.161.85.209.bl.score.senderscore.com): query failed (SERVFAIL)

> When trying to resolve any of these manually, it just returns
> NXDOMAIN.

What does
   dig -4 71.161.85.209.hostkarma.junkemailfilter.com +trace +nodnssec
show, and it is consistently NXDOMAIN? That ends here with:

71.161.85.209.hostkarma.junkemailfilter.com. 2100 IN A 127.0.0.3
71.161.85.209.hostkarma.junkemailfilter.com. 2100 IN A 127.0.1.1
;; Received 93 bytes from 184.105.182.249#53(rbl1.junkemailfilter.com)
in 20 ms



> I also isolated a packet with the "server failure" information, but
> I'm unable to figure out what the data means. Would someone be
> interested in evaluating it for me? It's a 146-byte pcap file.
> https://drive.google.com/open?id=1Ui893Lg61psZCR8I_9SJtNqs-Sil_br

That is just the reply from bind to some other process running on the
same machine, reporting the server failure.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.14 (GNU/Linux)

iEYEAREKAAYFAluMefIACgkQL6j7milTFsETsgCgiUbEZtaS2BnRHP4VPh4ycfhF
UvwAnitRg/6OCRXvZsj9EJTygjol7M+u
=2DAt
-END PGP SIGNATURE-


___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Frequent timeout

2018-09-01 Thread Alex
Hi,
It was reported there was a permissions problem with my Google Drive
link to the pcap file only allowing access to Google users. This
should now be public:
https://drive.google.com/file/d/1Ui893Lg61psZCR8I_9SJtNqs-Sil_br5/view?usp=sharing

Thanks,
Alex

On Sat, Sep 1, 2018 at 11:45 PM Alex  wrote:
>
> On Sat, Sep 1, 2018 at 11:25 PM Carl Byington  wrote:
> >
> > -BEGIN PGP SIGNED MESSAGE-
> > Hash: SHA512
> >
> > On Fri, 2018-08-31 at 17:18 -0400, Alex wrote:
> > > ../../../lib/dns/resolver.c:3927 for support.coxbusiness.com/A in
> >
> > After 4 seconds, I get SERVFAIL on that name.
>
> Thank you for your help. Perhaps I picked a bad example?
>
> I happened to have a grep running against my current named.debug.log,
> and as I received your email, what I believe is a much more
> representative display of the problem occurred. I also have a packet
> capture below.
>
> It's probably mangled posting it here, so I'll also put it on
> pastebin, but it's a rapid-fire display of a series of failed queries
> at once. I've cut out much of the info preceding and following to make
> it more clear here. These all occurred in less than 20ms of each
> other.
>
> (71.161.85.209.ubl.unsubscore.com): query failed (SERVFAIL)
> (71.161.85.209.dnsbl-2.uceprotect.net): query failed (SERVFAIL)
> (71.161.85.209.dnsbl.sorbs.net): query failed (SERVFAIL)
> (71.161.85.209.bad.psky.me): query failed (SERVFAIL)
> (71.161.85.209.score.senderscore.com): query failed (SERVFAIL)
> (71.161.85.209.list.dnswl.org): query failed (SERVFAIL)
> (71.161.85.209.zz.countries.nerd.dk): query failed (SERVFAIL)
> (71.161.85.209.cidr.bl.mcafee.com): query failed (SERVFAIL)
> (71.161.85.209.bl.mailspike.net): query failed (SERVFAIL)
> (71.161.85.209.wl.mailspike.net): query failed (SERVFAIL)
> (71.161.85.209.db.wpbl.info): query failed (SERVFAIL)
> (71.161.85.209.sip.helpfulblacklist.xyz): query failed (SERVFAIL)
> (71.161.85.209.dnsbl-3.uceprotect.net): query failed (SERVFAIL)
> (71.161.85.209.backscatter.spameatingmonkey.net): query failed (SERVFAIL)
> (71.161.85.209.hostkarma.junkemailfilter.com): query failed (SERVFAIL)
> (71.161.85.209.bl.score.senderscore.com): query failed (SERVFAIL)
>
> When trying to resolve any of these manually, it just returns NXDOMAIN.
>
> See the entirety of the log here:
> https://pastebin.com/JpHCDdQs
>
> Each of the lines above also has a corresponding entry like this:
>
> 01-Sep-2018 23:31:06.701 query-errors: debug 2: fetch completed at
> ../../../lib/dns/resolver.c:3927 for 71.161.85.209.bad.psky.me/A in
> 10.78: timed out/success
> [domain:psky.me,referral:0,restart:4,qrysent:8,timeout:7,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0]
>
> I also isolated a packet with the "server failure" information, but
> I'm unable to figure out what the data means. Would someone be
> interested in evaluating it for me? It's a 146-byte pcap file.
> https://drive.google.com/open?id=1Ui893Lg61psZCR8I_9SJtNqs-Sil_br
>
> Thanks for any ideas.
> Alex
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Frequent timeout

2018-09-01 Thread Alex
On Sat, Sep 1, 2018 at 11:25 PM Carl Byington  wrote:
>
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA512
>
> On Fri, 2018-08-31 at 17:18 -0400, Alex wrote:
> > ../../../lib/dns/resolver.c:3927 for support.coxbusiness.com/A in
>
> After 4 seconds, I get SERVFAIL on that name.

Thank you for your help. Perhaps I picked a bad example?

I happened to have a grep running against my current named.debug.log,
and as I received your email, what I believe is a much more
representative display of the problem occurred. I also have a packet
capture below.

It's probably mangled posting it here, so I'll also put it on
pastebin, but it's a rapid-fire display of a series of failed queries
at once. I've cut out much of the info preceding and following to make
it more clear here. These all occurred in less than 20ms of each
other.

(71.161.85.209.ubl.unsubscore.com): query failed (SERVFAIL)
(71.161.85.209.dnsbl-2.uceprotect.net): query failed (SERVFAIL)
(71.161.85.209.dnsbl.sorbs.net): query failed (SERVFAIL)
(71.161.85.209.bad.psky.me): query failed (SERVFAIL)
(71.161.85.209.score.senderscore.com): query failed (SERVFAIL)
(71.161.85.209.list.dnswl.org): query failed (SERVFAIL)
(71.161.85.209.zz.countries.nerd.dk): query failed (SERVFAIL)
(71.161.85.209.cidr.bl.mcafee.com): query failed (SERVFAIL)
(71.161.85.209.bl.mailspike.net): query failed (SERVFAIL)
(71.161.85.209.wl.mailspike.net): query failed (SERVFAIL)
(71.161.85.209.db.wpbl.info): query failed (SERVFAIL)
(71.161.85.209.sip.helpfulblacklist.xyz): query failed (SERVFAIL)
(71.161.85.209.dnsbl-3.uceprotect.net): query failed (SERVFAIL)
(71.161.85.209.backscatter.spameatingmonkey.net): query failed (SERVFAIL)
(71.161.85.209.hostkarma.junkemailfilter.com): query failed (SERVFAIL)
(71.161.85.209.bl.score.senderscore.com): query failed (SERVFAIL)

When trying to resolve any of these manually, it just returns NXDOMAIN.

See the entirety of the log here:
https://pastebin.com/JpHCDdQs

Each of the lines above also has a corresponding entry like this:

01-Sep-2018 23:31:06.701 query-errors: debug 2: fetch completed at
../../../lib/dns/resolver.c:3927 for 71.161.85.209.bad.psky.me/A in
10.78: timed out/success
[domain:psky.me,referral:0,restart:4,qrysent:8,timeout:7,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0]

I also isolated a packet with the "server failure" information, but
I'm unable to figure out what the data means. Would someone be
interested in evaluating it for me? It's a 146-byte pcap file.
https://drive.google.com/open?id=1Ui893Lg61psZCR8I_9SJtNqs-Sil_br

Thanks for any ideas.
Alex
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Frequent timeout

2018-09-01 Thread Carl Byington
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

On Fri, 2018-08-31 at 17:18 -0400, Alex wrote:
> ../../../lib/dns/resolver.c:3927 for support.coxbusiness.com/A in

After 4 seconds, I get SERVFAIL on that name.


> ../../../lib/dns/resolver.c:3927 for dell.ns.cloudflare.com/A in

That name resolves here very quickly.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.14 (GNU/Linux)

iEYEAREKAAYFAluLV+AACgkQL6j7milTFsGAhwCfYmXS+l5XK0dl8oMDniz/eVIn
MXcAn0Com++6PPkec7Cb7GS6qvBjai8b
=AnFC
-END PGP SIGNATURE-



___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Frequent timeout

2018-08-31 Thread Chuck Swiger via bind-users
Hi, Alex--

On Aug 31, 2018, at 3:49 PM, Alex  wrote:
> The interface does show some packet loss:
> 
> br0: flags=4163  mtu 1500
> [ ... ]
>RX packets 1610535  bytes 963148307 (918.5 MiB)
>RX errors 0  dropped 5066  overruns 0  frame 0
> 
> Is some packet loss such as the above to be expected? I recall doing
> some network tests some time ago and found much of it was IPv6
> traffic, which is not being used.

0.3% dropped packets is a bit unusual for a NIC running against a switch;
it would be quite normal for a hub.  However, Linux tends to also count
various things like unknown VLAN tags, unknown protocols (ie, IPv6 traffic
on an IPv4-only host), etc as dropped RX packets.

Supposedly ethtool -S helps distinguish between actual interface errors
and traffic that your machine chooses to drop.

Regards,
-- 
-Chuck

___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Frequent timeout

2018-08-31 Thread Alex
Hi,

On Fri, Aug 31, 2018 at 5:54 PM Darcy, Kevin  wrote:
>
> I'll second the use of tcpdump, and also add that DNS query traffic, using 
> UDP by default, tends to be hypersensitive to packet loss. TCP will retry and 
> folks may not even notice a slight drop in performance, but DNS queries, 
> under the same conditions, can fail completely. Thus, DNS is often the 
> "canary in the coal mine" for conditions which lead to packet loss, sometimes 
> even an early warning of developing WAN and/or configuration issues.

Thanks so much for your help. I have some familiarity with tcpdump and
will investigate.

The interface does show some packet loss:

br0: flags=4163  mtu 1500
inet 68.195.193.45  netmask 255.255.255.248  broadcast 68.195.193.47
inet6 fe80::16da:e9ff:fe97:ab71  prefixlen 64  scopeid 0x20
inet6 ::16da:e9ff:fe97:ab71  prefixlen 64  scopeid 0x0
ether 14:da:e9:97:ab:71  txqueuelen 1000  (Ethernet)
RX packets 1610535  bytes 963148307 (918.5 MiB)
RX errors 0  dropped 5066  overruns 0  frame 0
TX packets 1958053  bytes 1243814299 (1.1 GiB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

# uptime
 18:45:08 up  2:49,  1 user,  load average: 0.46, 0.53, 0.66

Is some packet loss such as the above to be expected? I recall doing
some network tests some time ago and found much of it was IPv6
traffic, which is not being used.

bind is running on localhost, so I will trace packets there, but what
am I looking for, to suspect it's a network problem? Will the normal
tcpdump packet size defaults suffice, or should I be capturing larger
amounts from each packet?

This is what I'll be doing for Labor Day weekend, so any help would
really be appreciated. Cablevision/Optonline has told me there are no
problems, but their tests aren't very thorough - if ping works and
doesn't drop packets at that particular time, the link must be fine.

Thanks,
Alex





>
>   
>  - Kevin
>
> On Fri, Aug 31, 2018 at 5:36 PM John W. Blue via bind-users 
>  wrote:
>>
>> tcpdump is your newest best friend to troubleshoot network issues.  You need 
>> to see what (if anything) is being placed on the wire and the responses (if 
>> any).  My goto syntax is:
>>
>> tcpdump -n -i eth0 port domain
>>
>> I like -n because it prevents a PTR lookup from happing.  Why add extra 
>> noise?  As with anything troubleshooting related it is a process of 
>> elimination.
>>
>> Good hunting!
>>
>> John
>>
>> Sent from Nine
>> 
>> From: Alex 
>> Sent: Friday, August 31, 2018 4:20 PM
>> To: bind-users@lists.isc.org
>> Subject: Frequent timeout
>>
>> Hi,
>>
>> Would someone please help me understand why I'm receiving so many
>> timeouts? This is on a fedora28 system with bind-9.11.4 acting as a
>> mail server and running on a cable modem.
>>
>> It appears to happen during all times, including when the link is
>> otherwise idle.
>>
>> 31-Aug-2018 16:52:57.297 query-errors: debug 2: fetch completed at
>> ../../../lib/dns/resolver.c:3927 for support.coxbusiness.com/A in
>> 10.000171: timed out/success
>> [domain:support.coxbusiness.com,referral:2,restart:4,qrysent:5,timeout:4,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0]
>>
>> 31-Aug-2018 17:06:42.655 query-errors: debug 2: fetch completed at
>> ../../../lib/dns/resolver.c:3927 for dell.ns.cloudflare.com/A in
>> 10.000108: timed out/success
>> [domain:cloudflare.com,referral:0,restart:2,qrysent:13,timeout:12,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0]
>>
>> What more information can I provide to troubleshoot this?
>>
>> Is it possible that even though the link otherwise seems to be
>> operating okay that there could still be some problem that would
>> affect DNS traffic?
>>
>> I've also clear all firewall rules, and it's not even all queries which fail.
>>
>> Thanks,
>> Alex
>> ___
>> Please visit https://lists.isc.org/mailman/listinfo/bind-users to 
>> unsubscribe from this list
>>
>> bind-users mailing list
>> bind-users@lists.isc.org
>> https://lists.isc.org/mailman/listinfo/bind-users
>> ___
>> Please visit https://lists.isc.org/mailman/listinfo/bind-users to 
>> unsubscribe from this list
>>
>> bind-users mailing list
>> bind-users@lists.isc.org
>> https://lists.isc.org/mailman/listinfo/bind-users
>
> _

Re: Frequent timeout

2018-08-31 Thread Darcy, Kevin
I'll second the use of tcpdump, and also add that DNS query traffic, using
UDP by default, tends to be hypersensitive to packet loss. TCP will retry
and folks may not even notice a slight drop in performance, but DNS
queries, under the same conditions, can fail completely. Thus, DNS is often
the "canary in the coal mine" for conditions which lead to packet loss,
sometimes even an early warning of developing WAN and/or configuration
issues.


   - Kevin

On Fri, Aug 31, 2018 at 5:36 PM John W. Blue via bind-users <
bind-users@lists.isc.org> wrote:

> tcpdump is your newest best friend to troubleshoot network issues.  You
> need to see what (if anything) is being placed on the wire and the
> responses (if any).  My goto syntax is:
>
> tcpdump -n -i eth0 port domain
>
> I like -n because it prevents a PTR lookup from happing.  Why add extra
> noise?  As with anything troubleshooting related it is a process of
> elimination.
>
> Good hunting!
>
> John
>
> Sent from Nine <http://www.9folders.com/>
> --
> *From:* Alex 
> *Sent:* Friday, August 31, 2018 4:20 PM
> *To:* bind-users@lists.isc.org
> *Subject:* Frequent timeout
>
> Hi,
>
> Would someone please help me understand why I'm receiving so many
> timeouts? This is on a fedora28 system with bind-9.11.4 acting as a
> mail server and running on a cable modem.
>
> It appears to happen during all times, including when the link is
> otherwise idle.
>
> 31-Aug-2018 16:52:57.297 query-errors: debug 2: fetch completed at
> ../../../lib/dns/resolver.c:3927 for support.coxbusiness.com/A in
> 10.000171: timed out/success
> [domain:support.coxbusiness.com
> ,referral:2,restart:4,qrysent:5,timeout:4,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0]
>
> 31-Aug-2018 17:06:42.655 query-errors: debug 2: fetch completed at
> ../../../lib/dns/resolver.c:3927 for dell.ns.cloudflare.com/A in
> 10.000108: timed out/success
> [domain:cloudflare.com
> ,referral:0,restart:2,qrysent:13,timeout:12,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0]
>
> What more information can I provide to troubleshoot this?
>
> Is it possible that even though the link otherwise seems to be
> operating okay that there could still be some problem that would
> affect DNS traffic?
>
> I've also clear all firewall rules, and it's not even all queries which
> fail.
>
> Thanks,
> Alex
> ___
> Please visit https://lists.isc.org/mailman/listinfo/bind-users to
> unsubscribe from this list
>
> bind-users mailing list
> bind-users@lists.isc.org
> https://lists.isc.org/mailman/listinfo/bind-users
> ___
> Please visit https://lists.isc.org/mailman/listinfo/bind-users to
> unsubscribe from this list
>
> bind-users mailing list
> bind-users@lists.isc.org
> https://lists.isc.org/mailman/listinfo/bind-users
>
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Frequent timeout

2018-08-31 Thread John W. Blue via bind-users
tcpdump is your newest best friend to troubleshoot network issues.  You need to 
see what (if anything) is being placed on the wire and the responses (if any).  
My goto syntax is:

tcpdump -n -i eth0 port domain

I like -n because it prevents a PTR lookup from happing.  Why add extra noise?  
As with anything troubleshooting related it is a process of elimination.

Good hunting!

John

Sent from Nine<http://www.9folders.com/>

From: Alex 
Sent: Friday, August 31, 2018 4:20 PM
To: bind-users@lists.isc.org
Subject: Frequent timeout

Hi,

Would someone please help me understand why I'm receiving so many
timeouts? This is on a fedora28 system with bind-9.11.4 acting as a
mail server and running on a cable modem.

It appears to happen during all times, including when the link is
otherwise idle.

31-Aug-2018 16:52:57.297 query-errors: debug 2: fetch completed at
../../../lib/dns/resolver.c:3927 for support.coxbusiness.com/A in
10.000171: timed out/success
[domain:support.coxbusiness.com,referral:2,restart:4,qrysent:5,timeout:4,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0]

31-Aug-2018 17:06:42.655 query-errors: debug 2: fetch completed at
../../../lib/dns/resolver.c:3927 for dell.ns.cloudflare.com/A in
10.000108: timed out/success
[domain:cloudflare.com,referral:0,restart:2,qrysent:13,timeout:12,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0]

What more information can I provide to troubleshoot this?

Is it possible that even though the link otherwise seems to be
operating okay that there could still be some problem that would
affect DNS traffic?

I've also clear all firewall rules, and it's not even all queries which fail.

Thanks,
Alex
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Frequent timeout

2018-08-31 Thread Alex
Hi,

Would someone please help me understand why I'm receiving so many
timeouts? This is on a fedora28 system with bind-9.11.4 acting as a
mail server and running on a cable modem.

It appears to happen during all times, including when the link is
otherwise idle.

31-Aug-2018 16:52:57.297 query-errors: debug 2: fetch completed at
../../../lib/dns/resolver.c:3927 for support.coxbusiness.com/A in
10.000171: timed out/success
[domain:support.coxbusiness.com,referral:2,restart:4,qrysent:5,timeout:4,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0]

31-Aug-2018 17:06:42.655 query-errors: debug 2: fetch completed at
../../../lib/dns/resolver.c:3927 for dell.ns.cloudflare.com/A in
10.000108: timed out/success
[domain:cloudflare.com,referral:0,restart:2,qrysent:13,timeout:12,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0]

What more information can I provide to troubleshoot this?

Is it possible that even though the link otherwise seems to be
operating okay that there could still be some problem that would
affect DNS traffic?

I've also clear all firewall rules, and it's not even all queries which fail.

Thanks,
Alex
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users