Hello Remi, Thanks for your reply!
> Not without patching the code. It would only take a few lines of C++ to make > it available, though. I'll take a look at the code, and see if it's something I could do myself and submit a PR for it, without looking, I'd assume it's effectively the same as the showDrops() one just using a different counter > The reason I'm not very fond of this idea is that health-check queries are > often not representative of actual traffic, and would thus skew the latency > metrics in many cases. But I get your point, while most deployments get a lot > of traffic and therefore don't really care about the short time it takes to > get useful metrics, it might be different for low-traffic deployments or for > backup servers. Do you think it might work if dnsdist were to update the > latency from health-check queries if, and only if, there was no "regular" > query processed by the server in a fixed interval (let's say 60 seconds? I > have not really thought about it). The first health-check query would then of > course automatically update the latency unless a "regular" query was > processed before the health-check succeeded. I think this could work, obviously for high(er) traffic environments, you'd likely still see much more downstream traffic, thus you'd have the available data, but I could see it being beneficial where e.g. you'd "sample" the health checks to be a part of the measurement (even if that would mean setting some special flag). e.g. if we do health checks every second, it could be 1 in 20 checks that would count towards the latency measurements, this way there's still the periodic checking for somewhat idle downstreams Best Regards, Lucas Rolff > On 26 Jun 2025, at 14:18, Remi Gacogne via dnsdist > <[email protected]> wrote: > > Hi Lucas, > > On 6/26/25 13:02, Lucas Rolff via dnsdist wrote: >> dnsdist by default uses leastOutstanding load balancing policy which in >> certain cases takes the lowest measured latency into account based on the >> last 128 queries answered by the downstream > > Correct, if several servers have the same number of outstanding queries their > latency is used to break the tie. > >> My first (somewhat simple) question is, is there a way to make health checks >> count towards the latency measurements, currently it doesn't seem to take >> the health check queries into account in the latency metric. While I >> understand not everyone may want this, I wonder if there's some way (even if >> custom Lua) to make that happen. > > Not without patching the code, I'm afraid. >> My second question, is more about a custom policy in Lua >> Since latency based load balancing isn't currently a thing, this can be >> implemented into Lua, so that the selected downstream server will be the >> lowest latency (online) server. >> This can be done by looping over the servers available, checking if the >> server is up using :isUp() and then using the :getLatency() to figure out >> the latency, this works great most of the time, however: >> 1: If dnsdist restarts, the latency across all nodes will be super low, >> because it seems to use a fixed size list, where every "empty" value is `0`. >> As a result when the average is calculated across 128 values (many of which >> are zero initially), this may cause some weird routing. > > True, it takes a few queries for the value to become useful. > >> I wonder if there's a way to get (currently in Lua) the number of downstream >> queries (e.g. as exposed in `showServers()` for each individual server. I >> see there's a :getDrops() method available, but seemingly no :getQueries() - >> is there another way we can somehow get these, while still being fast enough >> to execute on every upstream query (when the load balancing takes place). > > Not without patching the code. It would only take a few lines of C++ to make > it available, though. >> 2: A bit related to the first question, if we then decide to select the >> lowest latency server, because the other downstreams no longer get queries, >> we also don't get updated latency metrics, as you know sometimes routing on >> the interwebs change, and this may affect the latency. Thus if we could e.g. >> take the health checking measurements into account, this would at the same >> time be resolved, since we'd always have fresh data effectively. > > The reason I'm not very fond of this idea is that health-check queries are > often not representative of actual traffic, and would thus skew the latency > metrics in many cases. But I get your point, while most deployments get a lot > of traffic and therefore don't really care about the short time it takes to > get useful metrics, it might be different for low-traffic deployments or for > backup servers. Do you think it might work if dnsdist were to update the > latency from health-check queries if, and only if, there was no "regular" > query processed by the server in a fixed interval (let's say 60 seconds? I > have not really thought about it). The first health-check query would then of > course automatically update the latency unless a "regular" query was > processed before the health-check succeeded. > > Best regards, > -- > Remi Gacogne > PowerDNS.COM BV - https://www.powerdns.com/ > _______________________________________________ > dnsdist mailing list > [email protected] > https://mailman.powerdns.com/mailman/listinfo/dnsdist _______________________________________________ dnsdist mailing list [email protected] https://mailman.powerdns.com/mailman/listinfo/dnsdist
