Hi Remi,

Thanks for your clarifications (see inline below)

> On 6 Mar 2020, at 14:26, Remi Gacogne via dnsdist 
> <dnsdist@mailman.powerdns.com> wrote:
> 
> Signed PGP part
> Hi,
> 
> On 3/6/20 8:09 AM, Fredrik Pettai via dnsdist wrote:
>>> On 6 Mar 2020, at 05:42, Michael Van Der Beek <michael....@antlabs.com> 
>>> wrote:
>>> Have you noticed this setting on dnsdist.
>>> setUDPTimeout(num)
>> 
>> Yes, I did, but I didn’t play around with that before I sent the email to 
>> the mailing list
>> 
>>> Set the maximum time dnsdist will wait for a response from a backend over 
>>> UDP, in seconds. Defaults to 2
>>> I'm not sure if timeouts are classified as drops. My guess probably, 
>>> because it didn't get a response in time.
>> 
>> Yes they are.
> 
> "Drops", as reported by dnsdist, are almost always cause by the backend
> not responding fast enough. On some setups, dealing with 100k+ qps, it
> might also be caused by dnsdist not processing the responses fast
> enough, but that's very easy to spot because at least one of the dnsdist
> threads will use ~100% of one core.
> 
>>> Since your backend is a recursor. There are times that the recursor cannot 
>>> reach or encounters a non-responsive authoritative server.  Unbound has an 
>>> exponential backoff when querying such servers. I think it starts with 10s.
>>> https://nlnetlabs.nl/documentation/unbound/info-timeout/
>>> 
>>> I would suggest you set the dnsdist setUDPTImeout(10), frankly, if Unbound 
>>> cannot respond to you in < 10 seconds, most likely the target authoritative 
>>> server is not responding.
>> 
>> Good point, while I didn’t turn to the unbound documentation (thanks for the 
>> pointer) I played around with the UDPTimeout setting yesterday,
>> first increasing to setUDPTImeout(5), which yielded better results in terms 
>> of Drops (and increased the latency) and then later to 15, just to be sure 
>> that unbound really should be done with queries, and noticed that the Drops 
>> became a lot less (and latency increase again). But as you suggest, 
>> setUDPTImeout(10) is probably the ultimate setting.
> 
> OK so that settles it, your backends are not responding fast enough to
> some queries. I would really advise you to try to understand why the
> backend is taking so long to respond, instead of tuning dnsdist via
> setUDPTImeout(), because a latency greater than 2s is going to cause a
> lot of issues anyway.

Right, in this case the #1 reason for those queries that don’t make it under 
2s, are queries that some MX servers & software on those generates
A lot of crappy stuff out on the Internet are in contact with those 
servers/services, so broken reverse zones or badly setup domains that spams are 
what I see in topSlow() all the time.

This brings back one of the (last) questions in my original email, which was;
Is there a simple way to move those long tail queries / DNS clients into a 
“slow pool"?
Or maybe I should rephrase it to;
From a dnsdist PoW; would it be a good idea to move away clients that ask lots 
of questions about badly functioning domains, to their own worker pool?

I don’t seem to find any ready-to-use Rule/Action for applying clients that are 
causing X amount of SERVFAILs (or Timeouts) to a PoolAction.
(Although, I see there's a possibility to block clients with such query pattern 
(SERVFAIL/s), but that’s not the right solution or service in this case.)
(I’m guessing “anything can be done” with some clever Lua scripting, but that’s 
not really same as “simple")

I thought of using a NMG for statically map such client’s (the MX servers) into 
their own worker pool, but I didn’t get that to work :(
(perhaps I did it wrong or I misinterpret the function of a NMG)

Re,
Fredrik

Attachment: signature.asc
Description: Message signed with OpenPGP

_______________________________________________
dnsdist mailing list
dnsdist@mailman.powerdns.com
https://mailman.powerdns.com/mailman/listinfo/dnsdist

Reply via email to