On 2018-10-30 1:50 a.m., Nick Urbanik wrote:
Dear Marc,
Thank you for your reply.
On 29/10/18 10:14 -0400, Marc Branchaud via Unbound-users wrote:
On 2018-10-28 3:20 p.m., Nick Urbanik via Unbound-users wrote:
On 25/10/18 18:10 +1100, Nick Urbanik via Unbound-users wrote:
I am puzzled by the behaviour of our multi-level DNS system which
answered many queries for names having shorter TTLs with SERVFAIL.
I mean that SERVFAILs went up to 50% of replies, and current names
with TTLs of around 300 failed to be fetched by the resolver, the last
DNS servers in the chain. What I mean is that adding these two
configuration options (serve-expired: "yes" and cache-min-ttl: 30)
caused an outage. I am trying to understand why.
Any ideas in understanding the mechanism would be very welcome.
We use 1.6.8 with both those settings, and observed prolonged SERVFAIL
periods.
In our case, the upstream server became inaccessible for a period of
time, but when contact resumed the SERVFAILs persisted.
This behaviour was quite catastrophic, and to me, unexpected.
Do you have any idea of the mechanism behind this failure?
Is there a way to deal better with zero TTL names?
We reduced the infra-host-ttl value to compensate.
(Sorry for my slow response -- this slipped through the cracks.)
Did that bring your system to a functioning condition?
Yes & no. We reduced infra-host-ttl to 30 seconds, which means that we
are only affected by this for (up to) 30 seconds after upstream access
returns. That is adequate for our purposes.
So I think the mechanism is pretty clear, and I think it's good for
unbound to cache the upstream server's status for a period of time. I'm
just not convinced that 900 seconds is a reasonable default time.
(BTW, our case has nothing to do with zero TTL names: The IP address
configured as the zone's forward-addr became inaccessible. No names
involved. That said, I do not know how unbound deals with 0-TTL names.)
I do not think our case is a bug. It also has nothing to do with
serve-expired or cache-min-ttl. But since we use those settings, I
wanted to relate our experience with a confusing SERVFAIL situation.
In your multi-level system, are you 100% sure that all the forward-addr
IPs are *always* accessible? If they are, then you may be seeing
SERVFAILs for a different reason.
M.
(Why is infra-host-ttl's default 900 seconds? That seems like a long
time to wait to retry the upstream server.)
M.
By multilevel, I mean clients talk to one server, which forwards to
another, and for some clients, there is a third level of caching.
So it was unwise to add:
serve-expired: "yes"
cache-min-ttl: 30
to the server section of these DNS servers running unbound 1.6.8 on
up to date RHEL 7? Please could anyone cast some light on why this
was so? I will be spending some time examining the cause.
If you need more information, please let me know.