On 12/06/18 12:21, Mark Fermor, HolidayExtras.com wrote: > Hello, > > Running dnsmasq with these options: > /usr/sbin/dnsmasq -k --cache-size=50 --log-facility=- --user=nobody > --group=nobody --no-hosts --neg-ttl=60 --max-ttl=240 --max-cache-ttl=300 > > No local dnsmasq config file so that's literally all the config other > than defaults applied by dnsmasq > > dnsmasq -v > Dnsmasq version 2.78 Copyright (c) 2000-2017 Simon Kelley > Compile time options: IPv6 GNU-getopt no-DBus no-i18n no-IDN DHCP DHCPv6 > no-Lua TFTP no-conntrack ipset auth no-DNSSEC loop-detect inotify > > This is something running running in Kubernetes, they run as sidekick > containers to the main application container. I have multiple of the > same deployment running in the cluster, so they're all at the same > versions and receiving equal amounts of traffic via load balancing. They > all talk to the same endpoints outbound and do the same work load etc. > I've sent sigusr1 signal to all of the pods individually (all pods have > been running for approx 48 hours bar pod4 which has been running less > hours): > > pod1 > I0608 15:11:34.998127 1 nanny.go:116] dnsmasq[19]: time 1528470694 > I0608 15:11:34.998169 1 nanny.go:116] dnsmasq[19]: cache size 50, > 0/2267416 cache insertions re-used unexpired cache entries. > I0608 15:11:34.998175 1 nanny.go:116] dnsmasq[19]: queries > forwarded 3218560, queries answered locally 3182486 > I0608 15:11:34.998180 1 nanny.go:116] dnsmasq[19]: queries for > authoritative zones 0 > I0608 15:11:34.998184 1 nanny.go:116] dnsmasq[19]: server > 10.227.240.10#53: queries sent 3218560, retried or failed 16 > > pod2 > I0608 15:11:35.909168 1 nanny.go:116] dnsmasq[18]: time 1528470695 > I0608 15:11:35.909206 1 nanny.go:116] dnsmasq[18]: cache size 50, > 0/197465 cache insertions re-used unexpired cache entries. > I0608 15:11:35.909211 1 nanny.go:116] dnsmasq[18]: queries > forwarded 240843, queries answered locally 6159789 > I0608 15:11:35.909216 1 nanny.go:116] dnsmasq[18]: queries for > authoritative zones 0 > I0608 15:11:35.909219 1 nanny.go:116] dnsmasq[18]: server > 10.227.240.10#53: queries sent 240843, retried or failed 4 > > pod3 > I0608 15:11:36.948015 1 nanny.go:116] dnsmasq[20]: time 1528470696 > I0608 15:11:36.948083 1 nanny.go:116] dnsmasq[20]: cache size 50, > 0/63648 cache insertions re-used unexpired cache entries. > I0608 15:11:36.948138 1 nanny.go:116] dnsmasq[20]: queries > forwarded 46004, queries answered locally 6347223 > I0608 15:11:36.948188 1 nanny.go:116] dnsmasq[20]: queries for > authoritative zones 0 > I0608 15:11:36.948219 1 nanny.go:116] dnsmasq[20]: server > 10.227.240.10#53: queries sent 46004, retried or failed 1 > > pod4 > I0608 15:11:38.032330 1 nanny.go:116] dnsmasq[24]: time 1528470698 > I0608 15:11:38.032374 1 nanny.go:116] dnsmasq[24]: cache size 50, > 0/1359727 cache insertions re-used unexpired cache entries. > I0608 15:11:38.032382 1 nanny.go:116] dnsmasq[24]: queries > forwarded 1939395, queries answered locally 742411 > I0608 15:11:38.032388 1 nanny.go:116] dnsmasq[24]: queries for > authoritative zones 0 > I0608 15:11:38.032394 1 nanny.go:116] dnsmasq[24]: server > 10.227.240.10#53: queries sent 1939395, retried or failed 7 > > > The problem I notice, is some pods (pod1, pod2, pod4) are forwarding far > more requests than the other pods (an indication of what other is, would > be pod3). I'm not sure what's caused this seeing as the application is > doing the same across all pods. The only thing I notice here, is that > pods 1/2/4 all have a number of "retried or failed", which the other > pods don't. Therefore I wonder if the reason that those pods have sent > so many more requests upstream instead of hitting the cache, is because > of something to do with a "retried or failed", which then stops the > cache from working for a decent period of time? I've not been able to > find anything (google) relating to this scenario but it's the only thing > that makes sense to me right now. I can accept a couple of failures for > lookup here and there, but one failure (if i'm onto something that is), > seems to then cause no cache hits for a large period of time? >
There's no reason why a retry would affect the cache, so I think you're jumping to conclusions there. The "retried or failed" counts are in the noise: don't forget that DNS uses UDP transport by default, so those numbers arise from very rare dropped UDP packets - nothing to worry about. As for the differences in cache hit rates - it could be subtle synchronisation effects, especialy as all the dnsmasq instances point to the same upstream. If one pod hits a particular DNS name first, then that name will end up with a longer TTL in that pod than the others, which get the record later from the upstream after some of the TTL has run down. If something makes that happen a lot, that would affect the stats. TL;DR There doesn't appear to be a problem. Simon. _______________________________________________ Dnsmasq-discuss mailing list Dnsmasq-discuss@lists.thekelleys.org.uk http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss