Dear Marc and anyone else interested in why severe outages can be
caused by serve-expired: "yes" and cache-min-ttl: 30:
On 13/11/18 10:56 -0500, Marc Branchaud via Unbound-users wrote:
On 2018-10-30 1:50 a.m., Nick Urbanik wrote:
On 29/10/18 10:14 -0400, Marc Branchaud via Unbound-users wrote:
On 2018-10-28 3:20 p.m., Nick Urbanik via Unbound-users wrote:
On 25/10/18 18:10 +1100, Nick Urbanik via Unbound-users wrote:
I am puzzled by the behaviour of our multi-level DNS system which
answered many queries for names having shorter TTLs with SERVFAIL.
I mean that SERVFAILs went up to 50% of replies, and current names
with TTLs of around 300 failed to be fetched by the resolver, the last
DNS servers in the chain. What I mean is that adding these two
configuration options (serve-expired: "yes" and cache-min-ttl: 30)
caused an outage. I am trying to understand why.
Any ideas in understanding the mechanism would be very welcome.
We use 1.6.8 with both those settings, and observed prolonged SERVFAIL
periods.
In our case, the upstream server became inaccessible for a period of
time, but when contact resumed the SERVFAILs persisted.
This behaviour was quite catastrophic, and to me, unexpected.
And career affecting.
Do you have any idea of the mechanism behind this failure?
Is there a way to deal better with zero TTL names?
We reduced the infra-host-ttl value to compensate.
(Sorry for my slow response -- this slipped through the cracks.)
Did that bring your system to a functioning condition?
Yes & no. We reduced infra-host-ttl to 30 seconds, which means that we
are only affected by this for (up to) 30 seconds after upstream access
returns. That is adequate for our purposes.
So I think the mechanism is pretty clear, and I think it's good for
unbound to cache the upstream server's status for a period of time. I'm
just not convinced that 900 seconds is a reasonable default time.
(BTW, our case has nothing to do with zero TTL names: The IP address
configured as the zone's forward-addr became inaccessible. No names
involved. That said, I do not know how unbound deals with 0-TTL names.)
I do not think our case is a bug. It also has nothing to do with
serve-expired or cache-min-ttl. But since we use those settings, I
wanted to relate our experience with a confusing SERVFAIL situation.
How busy are your systems?
In your multi-level system, are you 100% sure that all the
forward-addr IPs are *always* accessible? If they are, then you may
be seeing SERVFAILs for a different reason.
Absolutely; they are all in our local network. And when I removed
those two configuration values, everything came back to normal good
behaviour almost immediately. Perhaps a distinguishing factor is that
some of these systems deal with in the order of up to 50,000 mixed
queries per second.
The result was so unexpected and surprisingly severe and I categorise
our situation as the result of a very serious bug. Tomorrow there are
repercussions for me personally.
Defining the bug is all complicated by the fact that before this
happened, I had chosen to change jobs within the same company, so no
longer have access to these systems to test the effects of those
configuration values. I don't know if it was one, the other, or a
combination of both that caused the problem. Perhaps no one but me
wants to find out.
(Why is infra-host-ttl's default 900 seconds? That seems like a long
time to wait to retry the upstream server.)
M.
By multilevel, I mean clients talk to one server, which forwards to
another, and for some clients, there is a third level of caching.
So it was unwise to add:
serve-expired: "yes"
cache-min-ttl: 30
to the server section of these DNS servers running unbound 1.6.8 on
up to date RHEL 7?
Hint: the answer is an unreserved "YES!".
Please could anyone cast some light on why this
was so? I will be spending some time examining the cause.
If you need more information, please let me know.
--
Nick Urbanik http://nicku.org [email protected]
GPG: 7FFA CDC7 5A77 0558 DC7A 790A 16DF EC5B BB9D 2C24 ID: BB9D2C24