I got a trouble ticket on this too.
From the looks of things, Cisco is using GSSes to load-balance this
site. GSSes return SERVFAIL if all of the resources behind the
load-balancer are down (which it determines via a heartbeat mechanism).
So I think this is a "simple" case of a website (or cluster) going down.
It was down earlier today, then up again, as of this writing, it is down
again.
DNS doesn't really have a response code of "requested resource not
available", so SERVFAIL is Cisco's closest approximation. It has the
drawback, however, of often making other sorts of problems appear to be
DNS problems. That's just a cross that we DNS admins have to bear...
- Kevin
On 3/1/2011 4:08 PM, Mike Bernhardt wrote:
I should add that tools.cisco.com was resolvable at one time, so either
Cisco's behavior has changed, or our firewall's behavior has changed. We
obviously haven't upgraded our BIND version in a while (9.4.3P3), so I don't
think the problem is BIND.
-----Original Message-----
From: Mike Bernhardt [mailto:bernha...@bart.gov]
Sent: Tuesday, March 01, 2011 12:40 PM
To: bind-users@lists.isc.org
Subject: Help with unresolvable domain (subdomain, actually)
For some reason, we can no longer resolve tools.cisco.com. there are several
clues to the problem but I can't put them together. Here is some dig output.
I know that the time stamps don't all match up below, but the results are
typical:
[root@ns1 ~]# dig +trace -b 148.165.3.10 tools.cisco.com
;<<>> DiG 9.4.3-P3<<>> +trace -b 148.165.3.10 tools.cisco.com
;; global options: printcmd
. 90550 IN NS i.root-servers.net.
. 90550 IN NS h.root-servers.net.
. 90550 IN NS e.root-servers.net.
. 90550 IN NS d.root-servers.net.
. 90550 IN NS j.root-servers.net.
. 90550 IN NS k.root-servers.net.
. 90550 IN NS l.root-servers.net.
. 90550 IN NS g.root-servers.net.
. 90550 IN NS f.root-servers.net.
. 90550 IN NS a.root-servers.net.
. 90550 IN NS m.root-servers.net.
. 90550 IN NS c.root-servers.net.
. 90550 IN NS b.root-servers.net.
;; Received 512 bytes from 148.165.3.10#53(148.165.3.10) in 0 ms
com. 172800 IN NS l.gtld-servers.net.
com. 172800 IN NS e.gtld-servers.net.
com. 172800 IN NS k.gtld-servers.net.
com. 172800 IN NS i.gtld-servers.net.
com. 172800 IN NS m.gtld-servers.net.
com. 172800 IN NS j.gtld-servers.net.
com. 172800 IN NS a.gtld-servers.net.
com. 172800 IN NS g.gtld-servers.net.
com. 172800 IN NS c.gtld-servers.net.
com. 172800 IN NS f.gtld-servers.net.
com. 172800 IN NS b.gtld-servers.net.
com. 172800 IN NS d.gtld-servers.net.
com. 172800 IN NS h.gtld-servers.net.
;; Received 505 bytes from 198.41.0.4#53(a.root-servers.net) in 13 ms
cisco.com. 172800 IN NS ns1.cisco.com.
cisco.com. 172800 IN NS ns2.cisco.com.
;; Received 101 bytes from 192.54.112.30#53(h.gtld-servers.net) in 154 ms
tools.cisco.com. 86400 IN NS
rcdn9-14p-dcz05n-gss1.cisco.com.
tools.cisco.com. 86400 IN NS rtp5-dmz-gss1.cisco.com.
tools.cisco.com. 86400 IN NS sjck-dmz-gss1.cisco.com.
tools.cisco.com. 86400 IN NS
cax01-bb14-dcz01n-gss1.cisco.com.
;; Received 226 bytes from 64.102.255.44#53(ns2.cisco.com) in 75 ms
;; Received 33 bytes from 72.163.4.28#53(rcdn9-14p-dcz05n-gss1.cisco.com) in
47 ms
Now, focusing in on rtp5-dmz-gss1.cisco.com for further analysis (just
picked it out of the group):
[root@ns1 ~]# dig -b 148.165.3.10 @rtp5-dmz-gss1.cisco.com tools.cisco.com
;<<>> DiG 9.4.3-P3<<>> -b 148.165.3.10 @rtp5-dmz-gss1.cisco.com
tools.cisco.com
; (1 server found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 5165
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available
;; QUESTION SECTION:
;tools.cisco.com. IN A
;; Query time: 75 msec
;; SERVER: 64.102.246.5#53(64.102.246.5)
;; WHEN: Tue Mar 1 12:22:57 2011
;; MSG SIZE rcvd: 33
Here is the output of tcpdump on my server, querying the same server via
nslookup elsewhere:
[root@ns1 ~]# tcpdump host -i bond0 64.102.246.5 -n -p -vvv
tcpdump: listening on bond0, link-type EN10MB (Ethernet), capture size 96
bytes
12:14:53.373614 IP (tos 0x0, ttl 64, id 45237, offset 0, flags [none],
proto: UDP (17), length: 61) 148.165.3.10.18673> 64.102.246.5.domain: [bad
udp cksum a78b!] 26095 A? tools.cisco.com. (33)
12:14:53.455684 IP (tos 0x0, ttl 54, id 7623, offset 0, flags [DF], proto:
UDP (17), length: 61) 64.102.246.5.domain> 148.165.3.10.18673: [udp sum ok]
26095 ServFail- q: A? tools.cisco.com. 0/0/0 (33)
Lastly, I see on our firewall log that we have a Checkpoint Smart Defense
log entry due to it's belief that Cisco is sending us a malformed query
packet, and it's being dropped. I don't know why they're sending the query
in the first place.
Number: 2595791
Date: 1Mar2011
Time: 12:22:53
Type: Log
Action: Drop
Service: domain-udp (53)
Source Port: domain-udp
Source: rtp5-dmz-gss1.cisco.com
Destination: ns
Protocol: udp
Information: Packet info: Packet data size: 28
Attack: Malformed Packet
Attack Information: UDP length error
Any ideas as to where the problem lies so I can pursue it further?
_______________________________________________
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users
_______________________________________________
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users