Am Tue, 9 May 2017 20:37:16 +0200 schrieb Lennart Poettering <lenn...@poettering.net>:
> On Tue, 09.05.17 00:42, Kai Krakow (hurikha...@gmail.com) wrote: > > > Am Sat, 6 May 2017 14:22:21 +0200 > > schrieb Kai Krakow <hurikha...@gmail.com>: > > > > > Am Fri, 5 May 2017 20:18:41 +0200 > > > schrieb Lennart Poettering <lenn...@poettering.net>: > > > > [...] > [...] > [...] > > > > > > It looks like this all has to do with timeouts: > > > > Fixed by restarting the router. The cable modem seems to be buggy > > with UDP packets after a lot of uptime: it simply silently drops UDP > > packets at regular intervals, WebUI was also very slow, probably a > > CPU issue. > > > > I'll follow up on this with the cable provider. > > > > When the problem starts to show up, systemd-resolved is affected > > more by this than direct resolving. I don't know if there's > > something that could be optimized in systemd-resolved to handle > > such issues better but I don't consider it a bug in > > systemd-resolved, it was a local problem. > > Normally configured DNS servers should be equivalent, and hence > switching them for each retry should not come at any cost, hence, > besides the extra log output, do you experience any real issues? Since I restarted the router, there are no longer any such logs except maybe a few per day (less than 4). But when I got those logs spammed to the journal, the real problem was the DNS resolver taking 10s about once per minute to resolve a website address - which really was a pita. But well, what could systemd-resolved have done about it when the real problem was some network equipment? I just wonder why it was less visible when directly using those DNS servers. Since DNS must have been designed with occassional packet loss in mind (because it uses UDP), there must be a way to handle this better. So I read a little bit in https://www.ietf.org/rfc/rfc1035.txt. RFC1035 section 4.2.1 suggests that the retransmission interval for queries should be 2-5 seconds, depending on statistics of previous queries. To me, "retransmissions" means the primary DNS server should not be switched for each query timeout it got (while still allowing to transfer the same request to the next available server). RFC1035 section 7 discusses the suggested implementation of the resolver and covers retransmission and server selection algorithms: It suggests to record average response time for each server it queries to select the ones which respond faster first. Without query history, the selection algorithm should pretend a response time of 5-10 seconds. It also suggests to switch the primary server only when there was some "bizarre" error or a server error reply. However, I don't think it should actually remove them from the list as suggested there since we are a client resolver, not a server resolver which can update its peer lists from neighbor servers. However, we could reset query time statistics to move it to the end of the list, and/or blacklist it for a while. Somewhere else in that document I've read that it is well permitted to interleave multiple parallel requests to multiple DNS servers in the list. So I guess it would be nice and allowed if systemd-resolved used more than only one DNS server at the same time by alternating between them each request - maybe taking the two best according to query time statistics. I also guess that it should maybe use shorter timeouts for queries as you could have more than one DNS server and the initial query time statistic should pretend 5-10 seconds, while the rotation interval suggests 2-5 seconds. I think it would work to have "10 seconds divided by servers count" or 2 seconds, whatever is bigger, as a timeout for query rotation. But a late reply should still be accepted as pointed out in section 7.3, even when the query was already rotated to the next DNS server. Using only a single DNS server can skip all this logic as there's no rotation and would work with timeouts of 10 seconds. That way, systemd-resolved would "learn" to use only the fastest DNS server and when it becomes too slow (which is 5-10 seconds based on the RFC), it would switch to the next server. If parallel requests come in, it would use more DNS servers from the list in parallel, auto-sorted by query reply time. The startup order is the one given by the administrator (or whatever provides the DNS server list). Applied to my UDP packet loss (which seem to be single packet losses as an immediate next request would've got a reply), it would mean that the systemd resolver gives me an address after 2-3 seconds instead of 5 or 10 because I had 4 DNS servers on that link. This is more or less what I've seen previously in my situation when I switched back to direct resolving instead of through systemd-resolved. What do you think? I could think of this being an implementation improvement project in the Github bug tracker. I would be willing to work on such an improvement given that the existing code is not too difficult to understand since I'm not a C pro (my last bigger project is like 10 years ago). Apparently, such an improvement could only be a spare time project for me, taking some time. Of course, all this only works when all DNS servers on the same link can resolve the same zones. Otherwise you will get a lot of timeouts and switching. I know that some people prefer to give DNS server IPs that resolve different zones (or can only resolve some). I never understood why one would supply such a mis-configuration, and it is mostly seen only in the Windows world, but well: it exists. A warning in the log when detecting such a situation could fix this. In the end, systemd-resolved is able to correctly handle per-link DNS servers. -- Regards, Kai Replies to list-only preferred. _______________________________________________ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel