Re: Clients get DNS timeouts because ipv6 means more queries for each lookup

2011-07-13 Thread Jonathan Kamens

On 07/13/2011 02:13 AM, Mark Andrews wrote:

No.  The fix is to correct the nameservers.  They are not correctly
following the DNS protocol and everything else is a fall out from
that.

You're right that everything else is fallout from that.

But that doesn't do me much good, does it? It's my system that keeps 
getting bogus name resolution errors. It's my RSS feed reader that keeps 
failing on an hourly basis when the cached records for en.wikipedia.org 
expire. It's all very well and good to say that the Wikipedia folks and 
other people with this problem should fix their nameservers -- I totally 
agree with that -- but it doesn't help me solve my problem /now/.


I'm a real user in the real world with a real problem. Yelling at 
Wikipedia to fix their DNS servers may feel good, but it doesn't make my 
DNS work. As far as I and all the other users who are being impacted 
/now/ by this problem are concerned, it's just pissing into the wind.

Well, all the prodding from people here prompted me to investigate
further exactly what's going on. The problem isn't what I thought it
was. It appears to be a bug in glibc, and I've filed a bug report and
found a workaround.

There is no bug in glibc.

To be blunt, that's bullshit.

If glibc makes an A query and an  query, and it gets back a valid 
response to the A query and an invalid response to the  query, then 
it should ignore the invalid response to the  query and return the 
valid A response to the user as the IP address for the host.


Please note, furthermore, that as I explained in detail in my bug report 
and in my last message, glibc behaves differently based on the /order/ 
in which the two responses are returned by the DNS server. Since there's 
nothing that says a DNS server has to respond to two queries in the 
order in which they were received, and that would be an impossible 
requirement to impose in any case, since the queries and responses are 
sent via UDP which doesn' guarantee order, it's perfectly clear that 
glibc needs to be prepared to function the same regardless of the order 
in which it receives the responses.


What's more, there's plenty of code in the glibc files I spent hours 
poring over which is clearly an attempt to do exactly that. The people 
who wrote the code just got it wrong. Which isn't surprising, given how 
god-awful the code is.


This is not an either/or situation. The broken nameservers should be 
fixed, /and/ glibc should be fixed to properly handle the case of when 
it sends two queries and gets back one valid response and one server 
error in reverse order.

In a nutshell, the getaddrinfo function in glibc sends both A and 
queries to the DNS server at the same time and then deals with the
responses as they come in. Unfortunately, if the responses to the two
queries come back in reverse order, /and/ the first one to come back is
a server failure, both of which are the case when you try to resolve
en.wikipedia.org immediately after restarting your DNS server so nothing
is cached, the glibc code screws up and decides it didn't get back a
successful response even though it did.

There is *nothing* wrong with sending both queries at once.
I didn't say there was. You really don't seem to be paying very good 
attention.


Do you understand what the word /workaround/ means?

Note your fix won't help clients that only ask for  records
because it is the authoritative servers that are broken, not the
resolver library or the recursive server.
I am aware of that. It is irrelevant, because it is not the problem I am 
trying to solve. I, and 99.99% of the users in the world, are /not/ 
only ask[ing] for  records. Nobody actually trying to use the 
internet for day-to-day work is doing that right now, because to say 
that IPv6 support is not yet ubiquitous would be a laughably momentous 
understatement.


You seem to have a really big chip on your shoulder about people who run 
broken DNS servers. I don't like them any more than you do. But I 
learned Be generous in what you accept and conservative in what you 
generate way back when I started playing with the Internet well over 
two decades ago. It holds up now as well as it did back then, and 
there's no good reason why it shouldn't apply in this case.


It's clear that this is a religious issue for you. I'm not here to 
debate religion, I'm here to get help making my DNS work, and to help 
other people, to whatever extent I can, make /their/ DNS work. If you 
continue to send religious screeds on this topic while making no effort 
to actually read and understand what I write, please do not expect me to 
respond further.


  Jonathan Kamens



smime.p7s
Description: S/MIME Cryptographic Signature
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

RE: Clients get DNS timeouts because ipv6 means more queries for each lookup

2011-07-13 Thread Jonathan Kamens
I agree that the order of the A/ responses shouldn't matter to the
result. The whole getaddrinfo() call should fail regardless of whether the
failure is seen first or the valid response is seen first. Why? Because
getaddrinfo() should, if it isn't already, be using the RFC 3484 algorithm
(and/or whatever the successor to RFC 3484 ends up being) to sort the
addresses, and for that algorithm to work, one needs *both* the IPv4
address(es) *and* the IPv6 address(es) available, in order to compare their
scopes, prefixes, etc..

 

RFC 3484 tells you how to sort addresses you've got.

 

If you've only got one address, then bang! It's already sorted for you. You
don't need RFC 3484 to tell you how to sort it.

 

I have to say that some of the people on this list seem completely detached
from what real users in the real world want their computers to do.

 

If I am trying to connect to a site on the internet, then I want my computer
to do its best to try to connect to the site. I don't want it to throw up
its hands and say, Oh, I'm sorry, one of my address lookups failed, so I'm
not going to let you use the other address lookup, the one that succeeded,
because some RFC somewhere could be interpreted as implying that's a bad
idea, if I wanted to do so. Please, that's ridiculous.

 

If one of the lookups fails, and this failure is presented to the RFC 3484
algorithm as NODATA for a particular address family, then the algorithm
could make a bad selection of the destination address, and this can lead to
other sorts of breakage, e.g. trying to use a tunneled connection where no
tunnel exists.

 

If the address the client gets doesn't work, then the address doesn't work.
How is being unable to connect because the address turned out to not be
routable different from being unable to connect because the computer refused
to even try?



Another possibility you're not considering is that the invoking application
itself may make independent IPv4-specific and IPv6-specific getaddrinfo()
lookups. Why would it do this? Why not? Maybe IPv6 capability is something
the user has to buy a separate license for, so the IPv6 part is a slightly
separate codepath, added in a later version, than the base product, which is
IPv4-only. When one of the getaddrinfo() calls returns address records and
the other returns garbage, your fix doesn't prevent such an application
from doing something unpredictable, possibly catastrophic. So it's really
not a general solution to the problem.

 

I have no idea what you're talking about. If the application makes
independent IPv4 and IPv6 getaddrinfo() lookups, then the change I'm
proposing to glibc is completely irrelevant and does not impact the existing
functionality in any way. The IPv4 lookup will succeed, the IPv6 lookup will
fail, and the application is then free to decide what to do.

 

In summary, getattrinfo() with AF_UNSPEC has a very clear meaning - Give me
whatever addresses you can. The man page says, and I am quoting, The value
AF_UNSPEC undicates that getaddrinfo() should return socket addresses for
any address family (either IPv4 or IPv6, for example) that can be used with
node and service. I don't see how the language could be any more clear. To
suggest that it's reasonable and correct for it to refuse to return a
successfully fetched address is simply ludicrous.

 

I hope and pray that people who maintain the glibc code have more common
sense about what users want and expect from their software.

 

In the meantime, it's clear that I don't belong on this mailing list, so I'm
out of here.

 

  Jonathan Kamens

 

___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Re: Clients get DNS timeouts because ipv6 means more queries for each lookup

2011-07-12 Thread Jonathan Kamens
Well, all the prodding from people here prompted me to investigate 
further exactly what's going on. The problem isn't what I thought it 
was. It appears to be a bug in glibc, and I've filed a bug report and 
found a workaround.


In a nutshell, the getaddrinfo function in glibc sends both A and  
queries to the DNS server at the same time and then deals with the 
responses as they come in. Unfortunately, if the responses to the two 
queries come back in reverse order, /and/ the first one to come back is 
a server failure, both of which are the case when you try to resolve 
en.wikipedia.org immediately after restarting your DNS server so nothing 
is cached, the glibc code screws up and decides it didn't get back a 
successful response even though it did.


If you do the same lookup again, it works, because the CNAME that was 
sent in response to the A query is cached, so both the A and  
queries get back valid responses from the DNS server. And even if that 
weren't the case, since the CNAME is cached it gets returned first, 
since the server doesn't need to do a query to get it, whereas it does 
need to do another query to get the  record (which recall isn't 
being cached because of the previously discussed FORMERR problem). It'll 
keep working until the cached records time out, at which point it'll 
happen again, and then be OK again until the records time out, etc.


The workaround is to put options single-request in /etc/resolv.conf to 
prevent the glibc innards from sending out both the A and  queries 
at the same time.


FYI, here's the glibc bug I filed about this:

http://sourceware.org/bugzilla/show_bug.cgi?id=12994

Thank you for telling me I was full of it and making me dig deeper into 
this until I located the actual cause of the issue. :-)


  jik

___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Clients get DNS timeouts because ipv6 means more queries for each lookup

2011-07-11 Thread Jonathan Kamens
The number of DNS queries required for each address lookup requested by 
a client has gone up considerably because of IPV6. The problem is being 
exacerbated by the fact that many DNS servers on the net don't yet 
support IPV6 queries. The result is that address lookups are frequently 
taking so long that the client gives up before getting the result.


The example I am seeing this with most frequently is my RSS feed reader, 
rss2email, trying to read a feed from en.wikipedia.org in a cron job 
that runs every 15 minutes. I am regularly seeing this in the output of 
the cron job:


   W: Name or service not known [8]
   http://en.wikipedia.org/w/index.php?title=/[elided]/feed=atomaction=history

The wikipedia.org domain has three DNS servers. Let's assume that the 
root and org. nameservers are cached already when rss2email does its 
query. If so, then it has to do the following queries:


   wikipedia.org DNS
   en.wikipedia.org 
   en.wikipedia.org A

This is fine when the wikipedia.org nameservers are working, but let's 
postulate for the moment that two of them are down, unreachable, or 
responding slowly, which apparently happens pretty often. Then we end up 
doing:


   wikipedia.org DNS
   en.wikipedia.org  /times out
   /en.wikipedia.org  /times out
   /en.wikipedia.org 
   en.wikipedia.org A /times out/
   en.wikipedia.org A /times out
   /en.wikipedia.org A

By now the end of that sequence, the typical 30-second DNS request 
timeout has been exceeded, and the client gives up.


I said above that the problem is exacerbated by the fact that many DNS 
servers don't yet support IPV6 queries. This is because the  queries 
don't get NXDOMAIN responses, which would be cached, but rather FORMERR 
responses, which are not cached. As a result, the scenario describes 
above happens much more frequently because the DNS server has to redo 
the  queries often.


One suggestion that I've seen on the net for how to mitigate this 
problem is to treat FORMERR responses as negative and cache them just 
like NXDOMAIN responses are cached. I took a look at the bind code in 
resolver.c briefly to see how easy it would be to do this, and I 
although it doesn't look like it would be particularly difficult, I 
don't feel like I know the ins and outs of the DNS protocol and BIND 
implementation enough to be confident that I'd get it right.


I'm interested to hear if other people are encountering this problem and 
if the developers who work on BIND have any thoughts about how to 
migitate it, short of getting everyone on the internet to upgrade to 
nameservers that support IPV6.


Thanks,

Jonathan Kamens



smime.p7s
Description: S/MIME Cryptographic Signature
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Re: Clients get DNS timeouts because ipv6 means more queries for each lookup

2011-07-11 Thread Jonathan Kamens

On 7/11/2011 3:10 PM, Tony Finch wrote:

Jonathan Kamensj...@kamens.us  wrote:

I said above that the problem is exacerbated by the fact that many DNS servers
don't yet support IPV6 queries. This is because the  queries don't get
NXDOMAIN responses, which would be cached, but rather FORMERR responses, which
are not cached. As a result, the scenario describes above happens much more
frequently because the DNS server has to redo the  queries often.

Your upstream resolver is broken if it returns FORMERR responses to 
queries. The behaviour you describe is not normal.
There are people reporting all over the net that they're getting tons of 
messages like this in their logs with recent BIND versions:


Jul 11 12:00:06 jik2 named[31354]: error (FORMERR) resolving 
'en.wikipedia.org//IN': 208.80.152.130#53


I've got 397 of them in my logs for just the last 24 hours.

I'm aware that this means the upstream DNS server is broken; isn't what 
what I said, i.e., that it isn't responding properly to  queries?


The problem is that I have no control over the upstream resolver. All I 
have control over is my own name server.


I am not the only one who is going to encounter this problem. I've found 
several reports of it on the net with a minimal amount of searching. I 
think something more general has to be done than giving me advice about 
what to change in my named.conf. I appreciate the advice for how to fix 
the problem for myself, but I think it needs to be fixed for everyone.


Have a look at bind's filter--on-v4 and deny-answer-addresses options
which should allow you prevent applications from trying to use IPv6.
Neither of these options are documented in named.conf(5) or 
resolv.conf(5). Is this a problem that is specific to the Fedora 15 
versions of these man pages, or is the documentation distributed with 
BIND out-of-date?


I tried to use the option and I get is not configured in my log when 
named starts up and then parsing failed, so I think my BIND must not 
be compiled with --enable-filter-, right? That makes it difficult to 
use this solution. Perhaps that's also why it isn't listed in the man page?


  jik



smime.p7s
Description: S/MIME Cryptographic Signature
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Re: Clients get DNS timeouts because ipv6 means more queries for each lookup

2011-07-11 Thread Jonathan Kamens

On 7/11/2011 3:26 PM, Eivind Olsen wrote:

I think the main issue here is - why is your nameserver thinking it has
IPv6 connectivity?

No, this isn't the issue.

I see the FORMERR errors in syslog and the timeouts resolving host names 
even when I start named with -4.


Named is querying for  records even when it is started with -4, and 
it is the querying, not the connectivity, that is the issue.


  jik



smime.p7s
Description: S/MIME Cryptographic Signature
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Re: Clients get DNS timeouts because ipv6 means more queries for each lookup

2011-07-11 Thread Jonathan Kamens

On 7/11/2011 4:06 PM, Bill Owens wrote:

https://lists.isc.org/pipermail/bind-users/2011-March/083109.html
  in which the first sentence says it all: The nameservers for wikipedia.org are 
broken.
It's not just wikipedia.org that's broken, obviously. I see this error 
in my logs for 19 domains since July 3:


Even if PowerDNS is the only source of this issue, and even if the new 
version of PowerDNS is released tomorrow, I'm sure there will still be 
sites running the old version a year from now. So just relying on a 
PowerDNS release to fix this problem seems unwise.


Users are experiencing this problem /now/ in the field, and more users 
will be experiencing it as BIND is upgraded in more and more places. 
Every single user relying on a Fedora 15 DNS server, for example, is 
going to see occasional unnecessary DNS timeouts when trying to resolve 
host names.


It seems clear to me that a generally available, generally applicable 
fix to BIND is needed to avoid this issue and perhaps similar issues 
like it.


  jik



smime.p7s
Description: S/MIME Cryptographic Signature
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users