Hi Remi, Thanks for your clarifications (see inline below)
> On 6 Mar 2020, at 14:26, Remi Gacogne via dnsdist > <dnsdist@mailman.powerdns.com> wrote: > > Signed PGP part > Hi, > > On 3/6/20 8:09 AM, Fredrik Pettai via dnsdist wrote: >>> On 6 Mar 2020, at 05:42, Michael Van Der Beek <michael....@antlabs.com> >>> wrote: >>> Have you noticed this setting on dnsdist. >>> setUDPTimeout(num) >> >> Yes, I did, but I didn’t play around with that before I sent the email to >> the mailing list >> >>> Set the maximum time dnsdist will wait for a response from a backend over >>> UDP, in seconds. Defaults to 2 >>> I'm not sure if timeouts are classified as drops. My guess probably, >>> because it didn't get a response in time. >> >> Yes they are. > > "Drops", as reported by dnsdist, are almost always cause by the backend > not responding fast enough. On some setups, dealing with 100k+ qps, it > might also be caused by dnsdist not processing the responses fast > enough, but that's very easy to spot because at least one of the dnsdist > threads will use ~100% of one core. > >>> Since your backend is a recursor. There are times that the recursor cannot >>> reach or encounters a non-responsive authoritative server. Unbound has an >>> exponential backoff when querying such servers. I think it starts with 10s. >>> https://nlnetlabs.nl/documentation/unbound/info-timeout/ >>> >>> I would suggest you set the dnsdist setUDPTImeout(10), frankly, if Unbound >>> cannot respond to you in < 10 seconds, most likely the target authoritative >>> server is not responding. >> >> Good point, while I didn’t turn to the unbound documentation (thanks for the >> pointer) I played around with the UDPTimeout setting yesterday, >> first increasing to setUDPTImeout(5), which yielded better results in terms >> of Drops (and increased the latency) and then later to 15, just to be sure >> that unbound really should be done with queries, and noticed that the Drops >> became a lot less (and latency increase again). But as you suggest, >> setUDPTImeout(10) is probably the ultimate setting. > > OK so that settles it, your backends are not responding fast enough to > some queries. I would really advise you to try to understand why the > backend is taking so long to respond, instead of tuning dnsdist via > setUDPTImeout(), because a latency greater than 2s is going to cause a > lot of issues anyway. Right, in this case the #1 reason for those queries that don’t make it under 2s, are queries that some MX servers & software on those generates A lot of crappy stuff out on the Internet are in contact with those servers/services, so broken reverse zones or badly setup domains that spams are what I see in topSlow() all the time. This brings back one of the (last) questions in my original email, which was; Is there a simple way to move those long tail queries / DNS clients into a “slow pool"? Or maybe I should rephrase it to; From a dnsdist PoW; would it be a good idea to move away clients that ask lots of questions about badly functioning domains, to their own worker pool? I don’t seem to find any ready-to-use Rule/Action for applying clients that are causing X amount of SERVFAILs (or Timeouts) to a PoolAction. (Although, I see there's a possibility to block clients with such query pattern (SERVFAIL/s), but that’s not the right solution or service in this case.) (I’m guessing “anything can be done” with some clever Lua scripting, but that’s not really same as “simple") I thought of using a NMG for statically map such client’s (the MX servers) into their own worker pool, but I didn’t get that to work :( (perhaps I did it wrong or I misinterpret the function of a NMG) Re, Fredrik
signature.asc
Description: Message signed with OpenPGP
_______________________________________________ dnsdist mailing list dnsdist@mailman.powerdns.com https://mailman.powerdns.com/mailman/listinfo/dnsdist