RE: Help with unresolvable domain (subdomain, actually)

2011-03-01 Thread Mike Bernhardt
I should add that tools.cisco.com was resolvable at one time, so either
Cisco's behavior has changed, or our firewall's behavior has changed. We
obviously haven't upgraded our BIND version in a while (9.4.3P3), so I don't
think the problem is BIND.

-Original Message-
From: Mike Bernhardt [mailto:bernha...@bart.gov] 
Sent: Tuesday, March 01, 2011 12:40 PM
To: bind-users@lists.isc.org
Subject: Help with unresolvable domain (subdomain, actually)

For some reason, we can no longer resolve tools.cisco.com. there are several
clues to the problem but I can't put them together. Here is some dig output.
I know that the time stamps don't all match up below, but the results are
typical:

[root@ns1 ~]# dig +trace -b 148.165.3.10 tools.cisco.com

; <<>> DiG 9.4.3-P3 <<>> +trace -b 148.165.3.10 tools.cisco.com
;; global options:  printcmd
.   90550   IN  NS  i.root-servers.net.
.   90550   IN  NS  h.root-servers.net.
.   90550   IN  NS  e.root-servers.net.
.   90550   IN  NS  d.root-servers.net.
.   90550   IN  NS  j.root-servers.net.
.   90550   IN  NS  k.root-servers.net.
.   90550   IN  NS  l.root-servers.net.
.   90550   IN  NS  g.root-servers.net.
.   90550   IN  NS  f.root-servers.net.
.   90550   IN  NS  a.root-servers.net.
.   90550   IN  NS  m.root-servers.net.
.   90550   IN  NS  c.root-servers.net.
.   90550   IN  NS  b.root-servers.net.
;; Received 512 bytes from 148.165.3.10#53(148.165.3.10) in 0 ms

com.172800  IN  NS  l.gtld-servers.net.
com.172800  IN  NS  e.gtld-servers.net.
com.172800  IN  NS  k.gtld-servers.net.
com.172800  IN  NS  i.gtld-servers.net.
com.172800  IN  NS  m.gtld-servers.net.
com.172800  IN  NS  j.gtld-servers.net.
com.172800  IN  NS  a.gtld-servers.net.
com.172800  IN  NS  g.gtld-servers.net.
com.172800  IN  NS  c.gtld-servers.net.
com.172800  IN  NS  f.gtld-servers.net.
com.172800  IN  NS  b.gtld-servers.net.
com.172800  IN  NS  d.gtld-servers.net.
com.172800  IN  NS  h.gtld-servers.net.
;; Received 505 bytes from 198.41.0.4#53(a.root-servers.net) in 13 ms

cisco.com.  172800  IN  NS  ns1.cisco.com.
cisco.com.  172800  IN  NS  ns2.cisco.com.
;; Received 101 bytes from 192.54.112.30#53(h.gtld-servers.net) in 154 ms

tools.cisco.com.86400   IN  NS
rcdn9-14p-dcz05n-gss1.cisco.com.
tools.cisco.com.86400   IN  NS  rtp5-dmz-gss1.cisco.com.
tools.cisco.com.86400   IN  NS  sjck-dmz-gss1.cisco.com.
tools.cisco.com.86400   IN  NS
cax01-bb14-dcz01n-gss1.cisco.com.
;; Received 226 bytes from 64.102.255.44#53(ns2.cisco.com) in 75 ms

;; Received 33 bytes from 72.163.4.28#53(rcdn9-14p-dcz05n-gss1.cisco.com) in
47 ms

Now, focusing in on rtp5-dmz-gss1.cisco.com for further analysis (just
picked it out of the group):
[root@ns1 ~]# dig -b 148.165.3.10 @rtp5-dmz-gss1.cisco.com tools.cisco.com

; <<>> DiG 9.4.3-P3 <<>> -b 148.165.3.10 @rtp5-dmz-gss1.cisco.com
tools.cisco.com
; (1 server found)
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 5165
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;tools.cisco.com.   IN  A

;; Query time: 75 msec
;; SERVER: 64.102.246.5#53(64.102.246.5)
;; WHEN: Tue Mar  1 12:22:57 2011
;; MSG SIZE  rcvd: 33


Here is the output of tcpdump on my server, querying the same server via
nslookup elsewhere:
[root@ns1 ~]# tcpdump host -i bond0 64.102.246.5 -n -p -vvv
tcpdump: listening on bond0, link-type EN10MB (Ethernet), capture size 96
bytes
12:14:53.373614 IP (tos 0x0, ttl  64, id 45237, offset 0, flags [none],
proto: UDP (17), length: 61) 148.165.3.10.18673 > 64.102.246.5.domain: [bad
udp cksum a78b!]  26095 A? tools.cisco.com. (33)
12:14:53.455684 IP (tos 0x0, ttl  54, id 7623, offset 0, flags [DF], proto:
UDP (17), length: 61) 64.102.246.5.domain > 148.165.3.10.18673: [udp sum ok]
26095 ServFail- q: A? tools.cisco.com. 0/0/0 (33)

Lastly, I see on our firewall log that we have a Checkpoint Smart Defense
log entry due to it's belief that Cisco is sending us a malformed query
packet, and it's being dropped. I don't know why they're sending the query
in the first place.
Number: 2595791
Date: 

Re: Help with unresolvable domain (subdomain, actually)

2011-03-01 Thread Mark Andrews

Ring Cisco and complain that their nameservers are broken for the
zone.

;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 13389
;; flags: qr; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;tools.cisco.com.   IN  A

;; Query time: 204 msec
;; SERVER: 72.163.4.28#53(rcdn9-14p-dcz05n-gss1.cisco.com)
;; WHEN: Wed Mar  2 08:23:59 2011
;; MSG SIZE  rcvd: 33

-- 
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742 INTERNET: ma...@isc.org
___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Help with unresolvable domain (subdomain, actually)

2011-03-01 Thread Shaoquan Lin

I was not able to resolve first and got the the same result as you got:

$ dig +trace tools.cisco.com

; <<>> DiG 9.6.1-P3 <<>> +trace tools.cisco.com
;; global options: +cmd
.   63808   IN  NS  a.root-servers.net.
.   63808   IN  NS  l.root-servers.net.
.   63808   IN  NS  d.root-servers.net.
.   63808   IN  NS  b.root-servers.net.
.   63808   IN  NS  m.root-servers.net.
.   63808   IN  NS  e.root-servers.net.
.   63808   IN  NS  h.root-servers.net.
.   63808   IN  NS  g.root-servers.net.
.   63808   IN  NS  c.root-servers.net.
.   63808   IN  NS  f.root-servers.net.
.   63808   IN  NS  k.root-servers.net.
.   63808   IN  NS  j.root-servers.net.
.   63808   IN  NS  i.root-servers.net.
;; Received 460 bytes from 134.74.14.2#53(134.74.14.2) in 8 ms

com.172800  IN  NS  l.gtld-servers.net.
com.172800  IN  NS  c.gtld-servers.net.
com.172800  IN  NS  k.gtld-servers.net.
com.172800  IN  NS  f.gtld-servers.net.
com.172800  IN  NS  d.gtld-servers.net.
com.172800  IN  NS  i.gtld-servers.net.
com.172800  IN  NS  j.gtld-servers.net.
com.172800  IN  NS  g.gtld-servers.net.
com.172800  IN  NS  h.gtld-servers.net.
com.172800  IN  NS  m.gtld-servers.net.
com.172800  IN  NS  e.gtld-servers.net.
com.172800  IN  NS  b.gtld-servers.net.
com.172800  IN  NS  a.gtld-servers.net.
;; Received 493 bytes from 192.5.5.241#53(f.root-servers.net) in 77 ms

cisco.com.  172800  IN  NS  ns1.cisco.com.
cisco.com.  172800  IN  NS  ns2.cisco.com.
;; Received 101 bytes from 192.43.172.30#53(i.gtld-servers.net) in 79 ms

tools.cisco.com.86400   IN  NS  sjck-dmz-gss1.cisco.com.
tools.cisco.com.86400   IN  NS  rtp5-dmz-gss1.cisco.com.
tools.cisco.com.86400   IN  NS
rcdn9-14p-dcz05n-gss1.cisco.com.
tools.cisco.com.86400   IN  NS
cax01-bb14-dcz01n-gss1.cisco.com.
;; Received 226 bytes from 128.107.241.185#53(ns1.cisco.com) in 80 ms

;; Received 33 bytes from
173.37.144.100#53(cax01-bb14-dcz01n-gss1.cisco.com) in 45 ms


But a few minutes later without any change on my site, I was able to
solve it:

$ host tools.cisco.com.
tools.cisco.com has address 128.107.242.16

$ dig +trace tools.cisco.com

; <<>> DiG 9.6.1-P3 <<>> +trace tools.cisco.com
;; global options: +cmd
.   63242   IN  NS  l.root-servers.net.
.   63242   IN  NS  m.root-servers.net.
.   63242   IN  NS  f.root-servers.net.
.   63242   IN  NS  k.root-servers.net.
.   63242   IN  NS  j.root-servers.net.
.   63242   IN  NS  d.root-servers.net.
.   63242   IN  NS  g.root-servers.net.
.   63242   IN  NS  h.root-servers.net.
.   63242   IN  NS  i.root-servers.net.
.   63242   IN  NS  e.root-servers.net.
.   63242   IN  NS  c.root-servers.net.
.   63242   IN  NS  a.root-servers.net.
.   63242   IN  NS  b.root-servers.net.
;; Received 488 bytes from 134.74.14.2#53(134.74.14.2) in 7 ms

com.172800  IN  NS  d.gtld-servers.net.
com.172800  IN  NS  k.gtld-servers.net.
com.172800  IN  NS  a.gtld-servers.net.
com.172800  IN  NS  j.gtld-servers.net.
com.172800  IN  NS  m.gtld-servers.net.
com.172800  IN  NS  h.gtld-servers.net.
com.172800  IN  NS  f.gtld-servers.net.
com.172800  IN  NS  b.gtld-servers.net.
com.172800  IN  NS  e.gtld-servers.net.
com.172800  IN  NS  g.gtld-servers.net.
com.172800  IN  NS  l.gtld-servers.net.
com.172800  IN  NS  i.gtld-servers.net.
com.172800  IN  NS  c.gtld-servers.net.
;; Received 505 bytes from 198.41.0.4#53(a.root-servers.net) in 13 ms

cisco.com.  172800  IN  NS  ns1.cisco.com.
cisco.com.  172800  IN  NS  

Re: Help with unresolvable domain (subdomain, actually)

2011-03-01 Thread Kevin Darcy

I got a trouble ticket on this too.

From the looks of things, Cisco is using GSSes to load-balance this 
site. GSSes return SERVFAIL if all of the resources behind the 
load-balancer are down (which it determines via a heartbeat mechanism). 
So I think this is a "simple" case of a website (or cluster) going down. 
It was down earlier today, then up again, as of this writing, it is down 
again.


DNS doesn't really have a response code of "requested resource not 
available", so SERVFAIL is Cisco's closest approximation. It has the 
drawback, however, of often making other sorts of problems appear to be 
DNS problems. That's just a cross that we DNS admins have to bear...




- Kevin


On 3/1/2011 4:08 PM, Mike Bernhardt wrote:

I should add that tools.cisco.com was resolvable at one time, so either
Cisco's behavior has changed, or our firewall's behavior has changed. We
obviously haven't upgraded our BIND version in a while (9.4.3P3), so I don't
think the problem is BIND.

-Original Message-
From: Mike Bernhardt [mailto:bernha...@bart.gov]
Sent: Tuesday, March 01, 2011 12:40 PM
To: bind-users@lists.isc.org
Subject: Help with unresolvable domain (subdomain, actually)

For some reason, we can no longer resolve tools.cisco.com. there are several
clues to the problem but I can't put them together. Here is some dig output.
I know that the time stamps don't all match up below, but the results are
typical:

[root@ns1 ~]# dig +trace -b 148.165.3.10 tools.cisco.com

;<<>>  DiG 9.4.3-P3<<>>  +trace -b 148.165.3.10 tools.cisco.com
;; global options:  printcmd
.   90550   IN  NS  i.root-servers.net.
.   90550   IN  NS  h.root-servers.net.
.   90550   IN  NS  e.root-servers.net.
.   90550   IN  NS  d.root-servers.net.
.   90550   IN  NS  j.root-servers.net.
.   90550   IN  NS  k.root-servers.net.
.   90550   IN  NS  l.root-servers.net.
.   90550   IN  NS  g.root-servers.net.
.   90550   IN  NS  f.root-servers.net.
.   90550   IN  NS  a.root-servers.net.
.   90550   IN  NS  m.root-servers.net.
.   90550   IN  NS  c.root-servers.net.
.   90550   IN  NS  b.root-servers.net.
;; Received 512 bytes from 148.165.3.10#53(148.165.3.10) in 0 ms

com.172800  IN  NS  l.gtld-servers.net.
com.172800  IN  NS  e.gtld-servers.net.
com.172800  IN  NS  k.gtld-servers.net.
com.172800  IN  NS  i.gtld-servers.net.
com.172800  IN  NS  m.gtld-servers.net.
com.172800  IN  NS  j.gtld-servers.net.
com.172800  IN  NS  a.gtld-servers.net.
com.172800  IN  NS  g.gtld-servers.net.
com.172800  IN  NS  c.gtld-servers.net.
com.172800  IN  NS  f.gtld-servers.net.
com.172800  IN  NS  b.gtld-servers.net.
com.172800  IN  NS  d.gtld-servers.net.
com.172800  IN  NS  h.gtld-servers.net.
;; Received 505 bytes from 198.41.0.4#53(a.root-servers.net) in 13 ms

cisco.com.  172800  IN  NS  ns1.cisco.com.
cisco.com.  172800  IN  NS  ns2.cisco.com.
;; Received 101 bytes from 192.54.112.30#53(h.gtld-servers.net) in 154 ms

tools.cisco.com.86400   IN  NS
rcdn9-14p-dcz05n-gss1.cisco.com.
tools.cisco.com.86400   IN  NS  rtp5-dmz-gss1.cisco.com.
tools.cisco.com.86400   IN  NS  sjck-dmz-gss1.cisco.com.
tools.cisco.com.86400   IN  NS
cax01-bb14-dcz01n-gss1.cisco.com.
;; Received 226 bytes from 64.102.255.44#53(ns2.cisco.com) in 75 ms

;; Received 33 bytes from 72.163.4.28#53(rcdn9-14p-dcz05n-gss1.cisco.com) in
47 ms

Now, focusing in on rtp5-dmz-gss1.cisco.com for further analysis (just
picked it out of the group):
[root@ns1 ~]# dig -b 148.165.3.10 @rtp5-dmz-gss1.cisco.com tools.cisco.com

;<<>>  DiG 9.4.3-P3<<>>  -b 148.165.3.10 @rtp5-dmz-gss1.cisco.com
tools.cisco.com
; (1 server found)
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 5165
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;tools.cisco.com.   IN  A

;; Query time: 75 msec
;; SERVER: 64.102.246.5#53(64.102.246.5)
;; WHEN: Tue Mar  1 12:22:57 2011
;; MSG SIZE

Re: Help with unresolvable domain (subdomain, actually)

2011-03-01 Thread Kevin Darcy
See my other post. This is designed-in behavior for Cisco GSSes, since 
there is no "service unavailable, try again later" RCODE.




- Kevin


On 3/1/2011 4:25 PM, Mark Andrews wrote:

Ring Cisco and complain that their nameservers are broken for the
zone.

;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 13389
;; flags: qr; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;tools.cisco.com.   IN  A

;; Query time: 204 msec
;; SERVER: 72.163.4.28#53(rcdn9-14p-dcz05n-gss1.cisco.com)
;; WHEN: Wed Mar  2 08:23:59 2011
;; MSG SIZE  rcvd: 33




___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Help with unresolvable domain (subdomain, actually)

2011-03-01 Thread Mark Andrews

In message <4d6d7268.1080...@chrysler.com>, Kevin Darcy writes:
> I got a trouble ticket on this too.
> 
>  From the looks of things, Cisco is using GSSes to load-balance this 
> site. GSSes return SERVFAIL if all of the resources behind the 
> load-balancer are down (which it determines via a heartbeat mechanism). 
> So I think this is a "simple" case of a website (or cluster) going down. 
> It was down earlier today, then up again, as of this writing, it is down 
> again.
> 
> DNS doesn't really have a response code of "requested resource not 
> available", so SERVFAIL is Cisco's closest approximation. It has the 
> drawback, however, of often making other sorts of problems appear to be 
> DNS problems. That's just a cross that we DNS admins have to bear...
>  
>  - Kevin

Then the load balancer should return default records or 0.0.0.0/:: to
indicate the name is good but doesn't currently have a address.

Mark
-- 
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742 INTERNET: ma...@isc.org
___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Help with unresolvable domain (subdomain, actually)

2011-03-02 Thread David Sparro



On 3/1/2011 5:27 PM, Kevin Darcy wrote:

See my other post. This is designed-in behavior for Cisco GSSes, since
there is no "service unavailable, try again later" RCODE.

- Kevin



When the question is "what is the ip address of 'foo'" an answer of "the 
web server is down" in nonsensical.


--
Dave
___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Help with unresolvable domain (subdomain, actually)

2011-03-02 Thread Warren Kumari


On Mar 1, 2011, at 5:27 PM, Kevin Darcy wrote:

See my other post. This is designed-in behavior for Cisco GSSes,  
since there is no "service unavailable, try again later" RCODE.


Yes[0].

W

[0]:  there is no "service unavailable, try again later" RCODE.






   - Kevin

On 3/1/2011 4:25 PM, Mark Andrews wrote:

Ring Cisco and complain that their nameservers are broken for the
zone.

;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 13389
;; flags: qr; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;tools.cisco.com.   IN  A

;; Query time: 204 msec
;; SERVER: 72.163.4.28#53(rcdn9-14p-dcz05n-gss1.cisco.com)
;; WHEN: Wed Mar  2 08:23:59 2011
;; MSG SIZE  rcvd: 33




___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users



--
There are only 10 types of people in this world -- those who  
understand binary arithmetic and those who don't.



___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Help with unresolvable domain (subdomain, actually)

2011-03-02 Thread Kevin Darcy

On 3/2/2011 10:34 AM, David Sparro wrote:



On 3/1/2011 5:27 PM, Kevin Darcy wrote:

See my other post. This is designed-in behavior for Cisco GSSes, since
there is no "service unavailable, try again later" RCODE.



When the question is "what is the ip address of 'foo'" an answer of 
"the web server is down" in nonsensical.


Hmmm... matter of perspective I suppose. Load-balancer architecture sees 
DNS as just the externally-visible portion of a whole subsystem. The 
SERVFAIL, in their view, does not communicate a DNS problem _per_se_, 
but a problem with the whole subsystem. It's more of a "what you're 
trying to get to is unavailable right now" message, communicated, in 
their view, _through_ DNS (as a sort of conduit), not necessarily 
_about_ DNS. They don't see it as specifically meaning "I've got a DNS 
problem".


I'm not saying I agree with this perspective, only that I've dealt with 
load-balancer vendors enough (Cisco in particular) to understand that 
this is where they're coming from.


Besides, what alternative is there? If the load-balancer returns an 
address that it knows to not be working, then it's purposely causing the 
client to go into a relatively-slow connection-timeout failure mode. Is 
that responsible behavior? If it gives a "normal" response that is 
lacking answer information (NODATA, NXDOMAIN), then this response gets 
negatively cached, and the negative cache entry may delay clients from 
re-trying the resource even after it recovers. So, what's left? NOTIMP? 
FORMERR? REFUSED? NOTAUTH? Those aren't any better than SERVFAIL from a 
strictly functional perspective, and are even more misleading and 
confusing with respect to the real source of the problem.




- Kevin



___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


RE: Help with unresolvable domain (subdomain, actually)

2011-03-02 Thread Mike Bernhardt
What's really strange is that when we attempt a query, be it DIG or an
attempt to browse tools.cisco.com, they send some sort of query back to us
from/to UDP 53. We drop it at the firewall due to some sort of "sanity
check" so I can't see the contents. This is in addition to the SERVFAIL
message.

Although I get SERVFAIL, Kloth.net does not, even if we DIG the same server:
cax01-bb14-dcz01n-gss1.cisco.com
>From Kloth
; <<>> DiG 9.3.2 <<>> @cax01-bb14-dcz01n-gss1.cisco.com tools.cisco.com A
 ; (1 server found)
 ;; global options:  printcmd
 ;; Got answer:
 ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 41388
 ;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
 
 ;; QUESTION SECTION:
 ;tools.cisco.com.  IN  A
 
 ;; ANSWER SECTION:
 tools.cisco.com.   20  IN  A   72.163.4.38
 
 ;; Query time: 131 msec
 ;; SERVER: 173.37.144.100#53(173.37.144.100)
 ;; WHEN: Wed Mar  2 19:15:04 2011
 ;; MSG SIZE  rcvd: 49

>From Us
[root@ns1 ~]# dig -b 148.165.3.10 @cax01-bb14-dcz01n-gss1.cisco.com
tools.cisco.com 

; <<>> DiG 9.4.3-P3 <<>> -b 148.165.3.10 @cax01-bb14-dcz01n-gss1.cisco.com
tools.cisco.com
; (1 server found)
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 26463
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;tools.cisco.com.   IN  A

;; Query time: 45 msec
;; SERVER: 173.37.144.100#53(173.37.144.100)
;; WHEN: Wed Mar  2 10:15:31 2011
;; MSG SIZE  rcvd: 33


So I wonder if the query they make is some kind of authentication attempt?


-Original Message-
From: Mark Andrews [mailto:ma...@isc.org] 
Sent: Tuesday, March 01, 2011 3:31 PM
To: Kevin Darcy
Cc: bind-us...@isc.org
Subject: Re: Help with unresolvable domain (subdomain, actually)


In message <4d6d7268.1080...@chrysler.com>, Kevin Darcy writes:
> I got a trouble ticket on this too.
> 
>  From the looks of things, Cisco is using GSSes to load-balance this 
> site. GSSes return SERVFAIL if all of the resources behind the 
> load-balancer are down (which it determines via a heartbeat mechanism). 
> So I think this is a "simple" case of a website (or cluster) going down. 
> It was down earlier today, then up again, as of this writing, it is down 
> again.
> 
> DNS doesn't really have a response code of "requested resource not 
> available", so SERVFAIL is Cisco's closest approximation. It has the 
> drawback, however, of often making other sorts of problems appear to be 
> DNS problems. That's just a cross that we DNS admins have to bear...
>  
>  - Kevin

Then the load balancer should return default records or 0.0.0.0/:: to
indicate the name is good but doesn't currently have a address.

Mark
-- 
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742 INTERNET: ma...@isc.org


___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Help with unresolvable domain (subdomain, actually)

2011-03-02 Thread David Sparro

On 3/2/2011 1:20 PM, Kevin Darcy wrote:


I'm not saying I agree with this perspective, only that I've dealt with
load-balancer vendors enough (Cisco in particular) to understand that
this is where they're coming from.

Besides, what alternative is there? If the load-balancer returns an
address that it knows to not be working, then it's purposely causing the
client to go into a relatively-slow connection-timeout failure mode. Is
that responsible behavior?


Short answer: yes.  The DNS side of the load-balancer has does't know 
why it got the query.  Maybe I was trying to ping the endpoint, I could 
have been trying to make an FTP connection, or HTTPS, etc.  In order for 
it to be consistent, it would have to be able to figure out that a 
SERVFAIL should be returned for the query from  my gopher:// connection, 
but an IP should be returned for http://.



If it gives a "normal" response that is
lacking answer information (NODATA, NXDOMAIN), then this response gets
negatively cached, and the negative cache entry may delay clients from
re-trying the resource even after it recovers. So, what's left? NOTIMP?
FORMERR? REFUSED? NOTAUTH? Those aren't any better than SERVFAIL from a
strictly functional perspective, and are even more misleading and
confusing with respect to the real source of the problem.


SERVFAIL caching is coming to a BIND server release this year.  (I 
listened to the BIND 9.8 features webinar this morning.  I don't 
remember which version (9.9 or 9.10) had this attached to it on the 
What's Next slide.)


--
Dave
___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Help with unresolvable domain (subdomain, actually)

2011-03-02 Thread Warren Kumari


On Mar 2, 2011, at 1:20 PM, Kevin Darcy wrote:


On 3/2/2011 10:34 AM, David Sparro wrote:



On 3/1/2011 5:27 PM, Kevin Darcy wrote:
See my other post. This is designed-in behavior for Cisco GSSes,  
since

there is no "service unavailable, try again later" RCODE.



When the question is "what is the ip address of 'foo'" an answer of  
"the web server is down" in nonsensical.


Hmmm... matter of perspective I suppose. Load-balancer architecture  
sees DNS as just the externally-visible portion of a whole  
subsystem. The SERVFAIL, in their view, does not communicate a DNS  
problem _per_se_, but a problem with the whole subsystem. It's more  
of a "what you're trying to get to is unavailable right now"  
message, communicated, in their view, _through_ DNS (as a sort of  
conduit), not necessarily _about_ DNS. They don't see it as  
specifically meaning "I've got a DNS problem".


But, everyone else *will*.



I'm not saying I agree with this perspective, only that I've dealt  
with load-balancer vendors enough (Cisco in particular) to  
understand that this is where they're coming from.


Besides, what alternative is there? If the load-balancer returns an  
address that it knows to not be working, then it's purposely causing  
the client to go into a relatively-slow connection-timeout failure  
mode. Is that responsible behavior? If it gives a "normal" response  
that is lacking answer information (NODATA, NXDOMAIN), then this  
response gets negatively cached, and the negative cache entry may  
delay clients from re-trying the resource even after it recovers.
So, what's left? NOTIMP? FORMERR? REFUSED? NOTAUTH? Those aren't any  
better than SERVFAIL from a strictly functional perspective, and are  
even more misleading and confusing with respect to the real source  
of the problem.


A few options:
1: once the LB knows that all back-ends are down, it can continue to  
answer with the correct A, but drop the TTL to be much shorter -- this  
allows things to recover faster.
2: have the LB itself serve a 'sorry' page -- the ability to serve  
static content locally should be simple, but if it not able to do so  
it can always return a set of 'sorry' servers optimized for this  
purpose.


You shouldn't be breaking both your serving *and* 'sorry' backends  
often enough for there to be special handling needed (and, if you are,  
you shouldn't make things worse by making other folk waste their time  
debugging your problem).


W





   - Kevin


___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users



--
I had no shoes and wept.  Then I met a man who had no feet.  So I  
said, "Hey man, got any shoes you're not using?"



___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Help with unresolvable domain (subdomain, actually)

2011-03-02 Thread Kevin Darcy

On 3/1/2011 6:30 PM, Mark Andrews wrote:

In message<4d6d7268.1080...@chrysler.com>, Kevin Darcy writes:

I got a trouble ticket on this too.

  From the looks of things, Cisco is using GSSes to load-balance this
site. GSSes return SERVFAIL if all of the resources behind the
load-balancer are down (which it determines via a heartbeat mechanism).
So I think this is a "simple" case of a website (or cluster) going down.
It was down earlier today, then up again, as of this writing, it is down
again.

DNS doesn't really have a response code of "requested resource not
available", so SERVFAIL is Cisco's closest approximation. It has the
drawback, however, of often making other sorts of problems appear to be
DNS problems. That's just a cross that we DNS admins have to bear...

  - Kevin

Then the load balancer should return default records or 0.0.0.0/:: to
indicate the name is good but doesn't currently have a address.
I like that solution, actually. Even if the client doesn't recognize it 
as a "special" address, hopefully if it tries to connect to it, the 
packet won't make it past the first router or switch hop...


Has anyone proposed this to the load-balancer vendors?


- Kevin


___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


RE: Help with unresolvable domain (subdomain, actually)

2011-03-02 Thread Mike Bernhardt
> A few options:
>1: once the LB knows that all back-ends are down, it can continue to answer
>with the correct A, but drop the TTL to be much shorter -- this allows
>things to recover faster.

This would work well because the actually web site wasn't down, at least not
yesterday. If I substituted the IP address for the domain name, it was
reachable and links maintained the domain portion of the URL in dotted
decimal format. It seems only DNS is hosed.

___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Help with unresolvable domain (subdomain, actually)

2011-03-02 Thread Warren Kumari


On Mar 2, 2011, at 1:21 PM, Mike Bernhardt wrote:


What's really strange is that when we attempt a query, be it DIG or an
attempt to browse tools.cisco.com, they send some sort of query back  
to us

from/to UDP 53


Many GSLB solutions attempt to figure out what the best location to  
serve from is by sending a query to the server that just queried  
*them* -- this allows them to figure out latency and decide which  
cluster might be closest
I'm suspecting (although I avoid Cisco LB like the plague and so am  
not sure) that this is the cause.



The other possibility --  I ran tcpdump to see if I could see what the  
query might be I found that I was getting a FormErr response to my  
initial query, causing me to requery without DNSSEC / EDNS0 -- maybe  
you are actually not seeing a query from them, mebe its a FormErr  
response that your FW is noting?


W

wkumari@vimes:~/src/perl/IODEF$ dig +edns=0 tools.cisco.com  
@128.107.227.197


; <<>> DiG 9.7.2-P3 <<>> +edns=0 tools.cisco.com @128.107.227.197
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: FORMERR, id: 41568
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;tools.cisco.com.   IN  A

;; Query time: 75 msec
;; SERVER: 128.107.227.197#53(128.107.227.197)
;; WHEN: Wed Mar  2 14:17:38 2011
;; MSG SIZE  rcvd: 33

wkumari@vimes:~/src/perl/IODEF$ dig  tools.cisco.com @128.107.227.197

; <<>> DiG 9.7.2-P3 <<>> tools.cisco.com @128.107.227.197
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 54960
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;tools.cisco.com.   IN  A

;; ANSWER SECTION:
tools.cisco.com.20  IN  A   173.37.145.8

;; Query time: 75 msec
;; SERVER: 128.107.227.197#53(128.107.227.197)
;; WHEN: Wed Mar  2 14:17:45 2011
;; MSG SIZE  rcvd: 49






. We drop it at the firewall due to some sort of "sanity
check" so I can't see the contents. This is in addition to the  
SERVFAIL

message.

Although I get SERVFAIL, Kloth.net does not, even if we DIG the same  
server:

cax01-bb14-dcz01n-gss1.cisco.com

From Kloth
; <<>> DiG 9.3.2 <<>> @cax01-bb14-dcz01n-gss1.cisco.com  
tools.cisco.com A

; (1 server found)
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 41388
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;tools.cisco.com.   IN  A

;; ANSWER SECTION:
tools.cisco.com.20  IN  A   72.163.4.38

;; Query time: 131 msec
;; SERVER: 173.37.144.100#53(173.37.144.100)
;; WHEN: Wed Mar  2 19:15:04 2011
;; MSG SIZE  rcvd: 49


From Us

[root@ns1 ~]# dig -b 148.165.3.10 @cax01-bb14-dcz01n-gss1.cisco.com
tools.cisco.com

; <<>> DiG 9.4.3-P3 <<>> -b 148.165.3.10 @cax01-bb14-dcz01n- 
gss1.cisco.com

tools.cisco.com
; (1 server found)
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 26463
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;tools.cisco.com.   IN  A

;; Query time: 45 msec
;; SERVER: 173.37.144.100#53(173.37.144.100)
;; WHEN: Wed Mar  2 10:15:31 2011
;; MSG SIZE  rcvd: 33


So I wonder if the query they make is some kind of authentication  
attempt?



-Original Message-
From: Mark Andrews [mailto:ma...@isc.org]
Sent: Tuesday, March 01, 2011 3:31 PM
To: Kevin Darcy
Cc: bind-us...@isc.org
Subject: Re: Help with unresolvable domain (subdomain, actually)


In message <4d6d7268.1080...@chrysler.com>, Kevin Darcy writes:

I got a trouble ticket on this too.

From the looks of things, Cisco is using GSSes to load-balance this
site. GSSes return SERVFAIL if all of the resources behind the
load-balancer are down (which it determines via a heartbeat  
mechanism).
So I think this is a "simple" case of a website (or cluster) going  
down.
It was down earlier today, then up again, as of this writing, it is  
down

again.

DNS doesn't really have a response code of "requested resource not
available", so SERVFAIL is Cisco's closest approximation. It has the
drawback, however, of often making other sorts of problems appear  
to be

DNS problems. That's just a cross that we DNS admins have to bear...

- Kevin


Then the load balancer should return default records or 0.0.0.0/:: to
indicate the name is good but doesn't currently have a address.

Mark
--
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Austra

Re: Help with unresolvable domain (subdomain, actually)

2011-03-02 Thread Kevin Darcy

On 3/2/2011 1:57 PM, David Sparro wrote:

On 3/2/2011 1:20 PM, Kevin Darcy wrote:


I'm not saying I agree with this perspective, only that I've dealt with
load-balancer vendors enough (Cisco in particular) to understand that
this is where they're coming from.

Besides, what alternative is there? If the load-balancer returns an
address that it knows to not be working, then it's purposely causing the
client to go into a relatively-slow connection-timeout failure mode. Is
that responsible behavior?


Short answer: yes.  The DNS side of the load-balancer has does't know 
why it got the query.  Maybe I was trying to ping the endpoint, I 
could have been trying to make an FTP connection, or HTTPS, etc.  In 
order for it to be consistent, it would have to be able to figure out 
that a SERVFAIL should be returned for the query from  my gopher:// 
connection, but an IP should be returned for http://.
That's an implementation decision. If an implementor decides to run a 
bunch of disparate services under a single FQDN (as opposed to, say, 
www.example.com/ftp.example.com/gopher.example.com and so forth), then 
they'd need to come up with a reasonable way with their load-balancer 
keepalives to decide when the whole thing is "down" or not. If the vast 
majority of their traffic is web-based (typical), they may choose to 
call the whole thing "down" if the web part is down, and the other parts 
(FTP, gopher, whatever) will just have to suffer. That's the price to be 
paid for the convenience of having a single name for a bunch of 
different services -- lack of granularity.


Things would be better, of course, if clients used SRV records for 
accessing resources -- then a single "service" name could be 
differentiated by protocol. But for whatever reason client software 
authors have not, by and large, embraced this idea.



If it gives a "normal" response that is
lacking answer information (NODATA, NXDOMAIN), then this response gets
negatively cached, and the negative cache entry may delay clients from
re-trying the resource even after it recovers. So, what's left? NOTIMP?
FORMERR? REFUSED? NOTAUTH? Those aren't any better than SERVFAIL from a
strictly functional perspective, and are even more misleading and
confusing with respect to the real source of the problem.


SERVFAIL caching is coming to a BIND server release this year.  (I 
listened to the BIND 9.8 features webinar this morning.  I don't 
remember which version (9.9 or 9.10) had this attached to it on the 
What's Next slide.)


I think Mark has the right approach: return a "special" address (e.g. 
0.0.0.0 or the IPv6 equivalent) in this situation, instead of messing 
with the RCODE.




- Kevin



___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Help with unresolvable domain (subdomain, actually)

2011-03-04 Thread John Wobus

Then the load balancer should return default records or 0.0.0.0/:: to
indicate the name is good but doesn't currently have a address.
I like that solution, actually. Even if the client doesn't recognize  
it

as a "special" address, hopefully if it tries to connect to it, the
packet won't make it past the first router or switch hop...

Has anyone proposed this to the load-balancer vendors?


Isn't this just a specific instance of configuring a load balancer's
fallback address?  E.g., when server A and B are both down, give  
address of

server C.  Some load balancers allow configuration of a server D to
be used only if C is down as well.  Address C or D could be configured
to be 0.0.0.0 and configured with no test for "up-ness".

(Not that I'm completely happy with 0.0.0.0 or any other address that
local folks could conceivably have figured out some crazy use for.)

John
___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Help with unresolvable domain (subdomain, actually)

2011-03-04 Thread Mark Andrews

In message , John Wobus write
s:
> >> Then the load balancer should return default records or 0.0.0.0/:: to
> >> indicate the name is good but doesn't currently have a address.
> > I like that solution, actually. Even if the client doesn't recognize  
> > it
> > as a "special" address, hopefully if it tries to connect to it, the
> > packet won't make it past the first router or switch hop...
> >
> > Has anyone proposed this to the load-balancer vendors?
> 
> Isn't this just a specific instance of configuring a load balancer's
> fallback address?  E.g., when server A and B are both down, give  
> address of
> server C.  Some load balancers allow configuration of a server D to
> be used only if C is down as well.  Address C or D could be configured
> to be 0.0.0.0 and configured with no test for "up-ness".
> 
> (Not that I'm completely happy with 0.0.0.0 or any other address that
> local folks could conceivably have figured out some crazy use for.)

0.0.0.0, means I don't know my address.  If you see packets on the
wire with 0.0.0.0, which you do at boot time, the machine that sent
them doesn't know its IP address yet.
 
> John
> ___
> bind-users mailing list
> bind-users@lists.isc.org
> https://lists.isc.org/mailman/listinfo/bind-users
-- 
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742 INTERNET: ma...@isc.org
___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users