Re: Clients get DNS timeouts because ipv6 means more queries for each lookup
On 07/13/2011 02:13 AM, Mark Andrews wrote: No. The fix is to correct the nameservers. They are not correctly following the DNS protocol and everything else is a fall out from that. You're right that everything else is fallout from that. But that doesn't do me much good, does it? It's my system that keeps getting bogus name resolution errors. It's my RSS feed reader that keeps failing on an hourly basis when the cached records for en.wikipedia.org expire. It's all very well and good to say that the Wikipedia folks and other people with this problem should fix their nameservers -- I totally agree with that -- but it doesn't help me solve my problem /now/. I'm a real user in the real world with a real problem. Yelling at Wikipedia to fix their DNS servers may feel good, but it doesn't make my DNS work. As far as I and all the other users who are being impacted /now/ by this problem are concerned, it's just pissing into the wind. Well, all the prodding from people here prompted me to investigate further exactly what's going on. The problem isn't what I thought it was. It appears to be a bug in glibc, and I've filed a bug report and found a workaround. There is no bug in glibc. To be blunt, that's bullshit. If glibc makes an A query and an query, and it gets back a valid response to the A query and an invalid response to the query, then it should ignore the invalid response to the query and return the valid A response to the user as the IP address for the host. Please note, furthermore, that as I explained in detail in my bug report and in my last message, glibc behaves differently based on the /order/ in which the two responses are returned by the DNS server. Since there's nothing that says a DNS server has to respond to two queries in the order in which they were received, and that would be an impossible requirement to impose in any case, since the queries and responses are sent via UDP which doesn' guarantee order, it's perfectly clear that glibc needs to be prepared to function the same regardless of the order in which it receives the responses. What's more, there's plenty of code in the glibc files I spent hours poring over which is clearly an attempt to do exactly that. The people who wrote the code just got it wrong. Which isn't surprising, given how god-awful the code is. This is not an either/or situation. The broken nameservers should be fixed, /and/ glibc should be fixed to properly handle the case of when it sends two queries and gets back one valid response and one server error in reverse order. In a nutshell, the getaddrinfo function in glibc sends both A and queries to the DNS server at the same time and then deals with the responses as they come in. Unfortunately, if the responses to the two queries come back in reverse order, /and/ the first one to come back is a server failure, both of which are the case when you try to resolve en.wikipedia.org immediately after restarting your DNS server so nothing is cached, the glibc code screws up and decides it didn't get back a successful response even though it did. There is *nothing* wrong with sending both queries at once. I didn't say there was. You really don't seem to be paying very good attention. Do you understand what the word /workaround/ means? Note your fix won't help clients that only ask for records because it is the authoritative servers that are broken, not the resolver library or the recursive server. I am aware of that. It is irrelevant, because it is not the problem I am trying to solve. I, and 99.99% of the users in the world, are /not/ only ask[ing] for records. Nobody actually trying to use the internet for day-to-day work is doing that right now, because to say that IPv6 support is not yet ubiquitous would be a laughably momentous understatement. You seem to have a really big chip on your shoulder about people who run broken DNS servers. I don't like them any more than you do. But I learned Be generous in what you accept and conservative in what you generate way back when I started playing with the Internet well over two decades ago. It holds up now as well as it did back then, and there's no good reason why it shouldn't apply in this case. It's clear that this is a religious issue for you. I'm not here to debate religion, I'm here to get help making my DNS work, and to help other people, to whatever extent I can, make /their/ DNS work. If you continue to send religious screeds on this topic while making no effort to actually read and understand what I write, please do not expect me to respond further. Jonathan Kamens smime.p7s Description: S/MIME Cryptographic Signature ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
RE: Clients get DNS timeouts because ipv6 means more queries for each lookup
I agree that the order of the A/ responses shouldn't matter to the result. The whole getaddrinfo() call should fail regardless of whether the failure is seen first or the valid response is seen first. Why? Because getaddrinfo() should, if it isn't already, be using the RFC 3484 algorithm (and/or whatever the successor to RFC 3484 ends up being) to sort the addresses, and for that algorithm to work, one needs *both* the IPv4 address(es) *and* the IPv6 address(es) available, in order to compare their scopes, prefixes, etc.. RFC 3484 tells you how to sort addresses you've got. If you've only got one address, then bang! It's already sorted for you. You don't need RFC 3484 to tell you how to sort it. I have to say that some of the people on this list seem completely detached from what real users in the real world want their computers to do. If I am trying to connect to a site on the internet, then I want my computer to do its best to try to connect to the site. I don't want it to throw up its hands and say, Oh, I'm sorry, one of my address lookups failed, so I'm not going to let you use the other address lookup, the one that succeeded, because some RFC somewhere could be interpreted as implying that's a bad idea, if I wanted to do so. Please, that's ridiculous. If one of the lookups fails, and this failure is presented to the RFC 3484 algorithm as NODATA for a particular address family, then the algorithm could make a bad selection of the destination address, and this can lead to other sorts of breakage, e.g. trying to use a tunneled connection where no tunnel exists. If the address the client gets doesn't work, then the address doesn't work. How is being unable to connect because the address turned out to not be routable different from being unable to connect because the computer refused to even try? Another possibility you're not considering is that the invoking application itself may make independent IPv4-specific and IPv6-specific getaddrinfo() lookups. Why would it do this? Why not? Maybe IPv6 capability is something the user has to buy a separate license for, so the IPv6 part is a slightly separate codepath, added in a later version, than the base product, which is IPv4-only. When one of the getaddrinfo() calls returns address records and the other returns garbage, your fix doesn't prevent such an application from doing something unpredictable, possibly catastrophic. So it's really not a general solution to the problem. I have no idea what you're talking about. If the application makes independent IPv4 and IPv6 getaddrinfo() lookups, then the change I'm proposing to glibc is completely irrelevant and does not impact the existing functionality in any way. The IPv4 lookup will succeed, the IPv6 lookup will fail, and the application is then free to decide what to do. In summary, getattrinfo() with AF_UNSPEC has a very clear meaning - Give me whatever addresses you can. The man page says, and I am quoting, The value AF_UNSPEC undicates that getaddrinfo() should return socket addresses for any address family (either IPv4 or IPv6, for example) that can be used with node and service. I don't see how the language could be any more clear. To suggest that it's reasonable and correct for it to refuse to return a successfully fetched address is simply ludicrous. I hope and pray that people who maintain the glibc code have more common sense about what users want and expect from their software. In the meantime, it's clear that I don't belong on this mailing list, so I'm out of here. Jonathan Kamens ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Clients get DNS timeouts because ipv6 means more queries for each lookup
Well, all the prodding from people here prompted me to investigate further exactly what's going on. The problem isn't what I thought it was. It appears to be a bug in glibc, and I've filed a bug report and found a workaround. In a nutshell, the getaddrinfo function in glibc sends both A and queries to the DNS server at the same time and then deals with the responses as they come in. Unfortunately, if the responses to the two queries come back in reverse order, /and/ the first one to come back is a server failure, both of which are the case when you try to resolve en.wikipedia.org immediately after restarting your DNS server so nothing is cached, the glibc code screws up and decides it didn't get back a successful response even though it did. If you do the same lookup again, it works, because the CNAME that was sent in response to the A query is cached, so both the A and queries get back valid responses from the DNS server. And even if that weren't the case, since the CNAME is cached it gets returned first, since the server doesn't need to do a query to get it, whereas it does need to do another query to get the record (which recall isn't being cached because of the previously discussed FORMERR problem). It'll keep working until the cached records time out, at which point it'll happen again, and then be OK again until the records time out, etc. The workaround is to put options single-request in /etc/resolv.conf to prevent the glibc innards from sending out both the A and queries at the same time. FYI, here's the glibc bug I filed about this: http://sourceware.org/bugzilla/show_bug.cgi?id=12994 Thank you for telling me I was full of it and making me dig deeper into this until I located the actual cause of the issue. :-) jik ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Clients get DNS timeouts because ipv6 means more queries for each lookup
The number of DNS queries required for each address lookup requested by a client has gone up considerably because of IPV6. The problem is being exacerbated by the fact that many DNS servers on the net don't yet support IPV6 queries. The result is that address lookups are frequently taking so long that the client gives up before getting the result. The example I am seeing this with most frequently is my RSS feed reader, rss2email, trying to read a feed from en.wikipedia.org in a cron job that runs every 15 minutes. I am regularly seeing this in the output of the cron job: W: Name or service not known [8] http://en.wikipedia.org/w/index.php?title=/[elided]/feed=atomaction=history The wikipedia.org domain has three DNS servers. Let's assume that the root and org. nameservers are cached already when rss2email does its query. If so, then it has to do the following queries: wikipedia.org DNS en.wikipedia.org en.wikipedia.org A This is fine when the wikipedia.org nameservers are working, but let's postulate for the moment that two of them are down, unreachable, or responding slowly, which apparently happens pretty often. Then we end up doing: wikipedia.org DNS en.wikipedia.org /times out /en.wikipedia.org /times out /en.wikipedia.org en.wikipedia.org A /times out/ en.wikipedia.org A /times out /en.wikipedia.org A By now the end of that sequence, the typical 30-second DNS request timeout has been exceeded, and the client gives up. I said above that the problem is exacerbated by the fact that many DNS servers don't yet support IPV6 queries. This is because the queries don't get NXDOMAIN responses, which would be cached, but rather FORMERR responses, which are not cached. As a result, the scenario describes above happens much more frequently because the DNS server has to redo the queries often. One suggestion that I've seen on the net for how to mitigate this problem is to treat FORMERR responses as negative and cache them just like NXDOMAIN responses are cached. I took a look at the bind code in resolver.c briefly to see how easy it would be to do this, and I although it doesn't look like it would be particularly difficult, I don't feel like I know the ins and outs of the DNS protocol and BIND implementation enough to be confident that I'd get it right. I'm interested to hear if other people are encountering this problem and if the developers who work on BIND have any thoughts about how to migitate it, short of getting everyone on the internet to upgrade to nameservers that support IPV6. Thanks, Jonathan Kamens smime.p7s Description: S/MIME Cryptographic Signature ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Clients get DNS timeouts because ipv6 means more queries for each lookup
On 7/11/2011 3:10 PM, Tony Finch wrote: Jonathan Kamensj...@kamens.us wrote: I said above that the problem is exacerbated by the fact that many DNS servers don't yet support IPV6 queries. This is because the queries don't get NXDOMAIN responses, which would be cached, but rather FORMERR responses, which are not cached. As a result, the scenario describes above happens much more frequently because the DNS server has to redo the queries often. Your upstream resolver is broken if it returns FORMERR responses to queries. The behaviour you describe is not normal. There are people reporting all over the net that they're getting tons of messages like this in their logs with recent BIND versions: Jul 11 12:00:06 jik2 named[31354]: error (FORMERR) resolving 'en.wikipedia.org//IN': 208.80.152.130#53 I've got 397 of them in my logs for just the last 24 hours. I'm aware that this means the upstream DNS server is broken; isn't what what I said, i.e., that it isn't responding properly to queries? The problem is that I have no control over the upstream resolver. All I have control over is my own name server. I am not the only one who is going to encounter this problem. I've found several reports of it on the net with a minimal amount of searching. I think something more general has to be done than giving me advice about what to change in my named.conf. I appreciate the advice for how to fix the problem for myself, but I think it needs to be fixed for everyone. Have a look at bind's filter--on-v4 and deny-answer-addresses options which should allow you prevent applications from trying to use IPv6. Neither of these options are documented in named.conf(5) or resolv.conf(5). Is this a problem that is specific to the Fedora 15 versions of these man pages, or is the documentation distributed with BIND out-of-date? I tried to use the option and I get is not configured in my log when named starts up and then parsing failed, so I think my BIND must not be compiled with --enable-filter-, right? That makes it difficult to use this solution. Perhaps that's also why it isn't listed in the man page? jik smime.p7s Description: S/MIME Cryptographic Signature ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Clients get DNS timeouts because ipv6 means more queries for each lookup
On 7/11/2011 3:26 PM, Eivind Olsen wrote: I think the main issue here is - why is your nameserver thinking it has IPv6 connectivity? No, this isn't the issue. I see the FORMERR errors in syslog and the timeouts resolving host names even when I start named with -4. Named is querying for records even when it is started with -4, and it is the querying, not the connectivity, that is the issue. jik smime.p7s Description: S/MIME Cryptographic Signature ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Clients get DNS timeouts because ipv6 means more queries for each lookup
On 7/11/2011 4:06 PM, Bill Owens wrote: https://lists.isc.org/pipermail/bind-users/2011-March/083109.html in which the first sentence says it all: The nameservers for wikipedia.org are broken. It's not just wikipedia.org that's broken, obviously. I see this error in my logs for 19 domains since July 3: Even if PowerDNS is the only source of this issue, and even if the new version of PowerDNS is released tomorrow, I'm sure there will still be sites running the old version a year from now. So just relying on a PowerDNS release to fix this problem seems unwise. Users are experiencing this problem /now/ in the field, and more users will be experiencing it as BIND is upgraded in more and more places. Every single user relying on a Fedora 15 DNS server, for example, is going to see occasional unnecessary DNS timeouts when trying to resolve host names. It seems clear to me that a generally available, generally applicable fix to BIND is needed to avoid this issue and perhaps similar issues like it. jik smime.p7s Description: S/MIME Cryptographic Signature ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users