Re: [dnsdist] Latency based routing

Lucas Rolff via dnsdist Thu, 26 Jun 2025 05:25:56 -0700

Hello Remi,

Thanks for your reply!


> Not without patching the code. It would only take a few lines of C++ to make 
> it available, though.

I'll take a look at the code, and see if it's something I could do myself and 
submit a PR for it, without looking, I'd assume it's effectively the same as 
the showDrops() one just using a different counter

> The reason I'm not very fond of this idea is that health-check queries are 
> often not representative of actual traffic, and would thus skew the latency 
> metrics in many cases. But I get your point, while most deployments get a lot 
> of traffic and therefore don't really care about the short time it takes to 
> get useful metrics, it might be different for low-traffic deployments or for 
> backup servers. Do you think it might work if dnsdist were to update the 
> latency from health-check queries if, and only if, there was no "regular" 
> query processed by the server in a fixed interval (let's say 60 seconds? I 
> have not really thought about it). The first health-check query would then of 
> course automatically update the latency unless a "regular" query was 
> processed before the health-check succeeded.

I think this could work, obviously for high(er) traffic environments, you'd 
likely still see much more downstream traffic, thus you'd have the available 
data, but I could see it being beneficial where e.g. you'd "sample" the health 
checks to be a part of the measurement (even if that would mean setting some 
special flag). e.g. if we do health checks every second, it could be 1 in 20 
checks that would count towards the latency measurements, this way there's 
still the periodic checking for somewhat idle downstreams

Best Regards,
Lucas Rolff


> On 26 Jun 2025, at 14:18, Remi Gacogne via dnsdist 
> <[email protected]> wrote:
> 
> Hi Lucas,
> 
> On 6/26/25 13:02, Lucas Rolff via dnsdist wrote:
>> dnsdist by default uses leastOutstanding load balancing policy which in 
>> certain cases takes the lowest measured latency into account based on the 
>> last 128 queries answered by the downstream
> 
> Correct, if several servers have the same number of outstanding queries their 
> latency is used to break the tie.
> 
>> My first (somewhat simple) question is, is there a way to make health checks 
>> count towards the latency measurements, currently it doesn't seem to take 
>> the health check queries into account in the latency metric. While I 
>> understand not everyone may want this, I wonder if there's some way (even if 
>> custom Lua) to make that happen.
> 
> Not without patching the code, I'm afraid.
>> My second question, is more about a custom policy in Lua
>> Since latency based load balancing isn't currently a thing, this can be 
>> implemented into Lua, so that the selected downstream server will be the 
>> lowest latency (online) server.
>> This can be done by looping over the servers available, checking if the 
>> server is up using :isUp() and then using the :getLatency() to figure out 
>> the latency, this works great most of the time, however:
>> 1: If dnsdist restarts, the latency across all nodes will be super low, 
>> because it seems to use a fixed size list, where every "empty" value is `0`. 
>> As a result when the average is calculated across 128 values (many of which 
>> are zero initially), this may cause some weird routing.
> 
> True, it takes a few queries for the value to become useful.
> 
>> I wonder if there's a way to get (currently in Lua) the number of downstream 
>> queries (e.g. as exposed in `showServers()` for each individual server. I 
>> see there's a :getDrops() method available, but seemingly no :getQueries() - 
>> is there another way we can somehow get these, while still being fast enough 
>> to execute on every upstream query (when the load balancing takes place).
> 
> Not without patching the code. It would only take a few lines of C++ to make 
> it available, though.
>> 2: A bit related to the first question, if we then decide to select the 
>> lowest latency server, because the other downstreams no longer get queries, 
>> we also don't get updated latency metrics, as you know sometimes routing on 
>> the interwebs change, and this may affect the latency. Thus if we could e.g. 
>> take the health checking measurements into account, this would at the same 
>> time be resolved, since we'd always have fresh data effectively.
> 
> The reason I'm not very fond of this idea is that health-check queries are 
> often not representative of actual traffic, and would thus skew the latency 
> metrics in many cases. But I get your point, while most deployments get a lot 
> of traffic and therefore don't really care about the short time it takes to 
> get useful metrics, it might be different for low-traffic deployments or for 
> backup servers. Do you think it might work if dnsdist were to update the 
> latency from health-check queries if, and only if, there was no "regular" 
> query processed by the server in a fixed interval (let's say 60 seconds? I 
> have not really thought about it). The first health-check query would then of 
> course automatically update the latency unless a "regular" query was 
> processed before the health-check succeeded.
> 
> Best regards,
> -- 
> Remi Gacogne
> PowerDNS.COM BV - https://www.powerdns.com/
> _______________________________________________
> dnsdist mailing list
> [email protected]
> https://mailman.powerdns.com/mailman/listinfo/dnsdist

_______________________________________________
dnsdist mailing list
[email protected]
https://mailman.powerdns.com/mailman/listinfo/dnsdist

Re: [dnsdist] Latency based routing

Reply via email to