Re: [DNSOP] A question on values in draft-dnsop-caching-resolution-failures

2023-07-26 Thread Tim Wicinski
Duane/Evan/Mukund/All,

What do feel is the consensus on lowering the value to 1 second ?
>From the previous suggested text:

Resolvers MUST cache resolution failures for at least 1 second.
The initial duration SHOULD be configurable by the operator.  A
longer cache duration for resolution failures will reduce the
processing burden from repeated queries, but will also lengthen
the recovery period from transitory issues.

It does sound like this paragraph works better.

Resolvers SHOULD employ an exponential or linear backoff algorithm to
increase
the amount of time for subsequent resolution failures.  For example,
the initial time for negatively caching a resolution failure is set
to 5 seconds.  The time is increased after each retry that results in
another resolution failure.  Consistent with [RFC2308], resolution
failures MUST NOT be cached for longer than 5 minutes.

May we get some feedback on this?

thanks

tim


On Mon, Jul 24, 2023 at 11:41 AM Evan Hunt  wrote:

> On Mon, Jul 24, 2023 at 06:26:46PM +, Wessels, Duane wrote:
> > It was not our intention that “2” would be the only possible exponent in
> > the backoff algorithm.  Would this slightly revised text be more
> > agreeable?
> >
> >Resolvers SHOULD employ an exponential or linear backoff algorithm to
> >increase the amount of time for subsequent resolution failures.  For
> >example, the initial time for negatively caching a resolution failure
> >is set to 5 seconds.  The time is increased after each retry that
> >results in another resolution failure.  Consistent with [RFC2308],
> >resolution failures MUST NOT be cached for longer than 5 minutes.
>
> That's definitely an improvement, yes.
>
> --
> Evan Hunt -- e...@isc.org
> Internet Systems Consortium, Inc.
>
___
DNSOP mailing list
DNSOP@ietf.org
https://www.ietf.org/mailman/listinfo/dnsop


Re: [DNSOP] A question on values in draft-dnsop-caching-resolution-failures

2023-07-24 Thread Evan Hunt
On Mon, Jul 24, 2023 at 06:26:46PM +, Wessels, Duane wrote:
> It was not our intention that “2” would be the only possible exponent in
> the backoff algorithm.  Would this slightly revised text be more
> agreeable?
> 
>Resolvers SHOULD employ an exponential or linear backoff algorithm to
>increase the amount of time for subsequent resolution failures.  For
>example, the initial time for negatively caching a resolution failure
>is set to 5 seconds.  The time is increased after each retry that
>results in another resolution failure.  Consistent with [RFC2308],
>resolution failures MUST NOT be cached for longer than 5 minutes.

That's definitely an improvement, yes.

-- 
Evan Hunt -- e...@isc.org
Internet Systems Consortium, Inc.

___
DNSOP mailing list
DNSOP@ietf.org
https://www.ietf.org/mailman/listinfo/dnsop


Re: [DNSOP] A question on values in draft-dnsop-caching-resolution-failures

2023-07-24 Thread Wessels, Duane
Evan,

> On Jul 24, 2023, at 10:34 AM, Evan Hunt  wrote:
> 
> The original text says a series of seven resolution failures would increase
> the duration before a retry to five minutes: 5 seconds to 10 to 20 to 40 to
> 80 to 160 to 300. Lowering the starting value to one second means it would
> take nine failures to reach 300.
> 

It was not our intention that “2” would be the only possible exponent in the 
backoff
algorithm.  Would this slightly revised text be more agreeable?

   Resolvers SHOULD employ an exponential or linear backoff algorithm to 
increase
   the amount of time for subsequent resolution failures.  For example,
   the initial time for negatively caching a resolution failure is set
   to 5 seconds.  The time is increased after each retry that results in
   another resolution failure.  Consistent with [RFC2308], resolution
   failures MUST NOT be cached for longer than 5 minutes.


DW

___
DNSOP mailing list
DNSOP@ietf.org
https://www.ietf.org/mailman/listinfo/dnsop


Re: [DNSOP] A question on values in draft-dnsop-caching-resolution-failures

2023-07-24 Thread Evan Hunt
On Mon, Jul 24, 2023 at 10:00:37AM +0530, Mukund Sivaraman wrote:
> When seeing prescriptive text, implementors often wants to know the
> rationale behind it. If the value of 5 is changed to 1, please mention
> and have the authors include in the document why the lower limit is
> 1s. Is it an arbitrary change? Is this change based on the default value
> of BIND's servfail-ttl named.conf option?

Yes, it is.

For background: BIND implemented a SERVFAIL cache in 2014 with a default
cache duration of 10 seconds; after a slew of complaints, in 2015 we
lowered it to 1 second, and also reduced the configurable maximum from
5 minutes to 30 seconds. The reason was that certain common failure
conditions are transitory, and it's not unreasonable to prioritize
rapid recovery.

Now, to be clear, the comparison isn't exactly apples to apples: the BIND
SERVFAIL cache is a somewhat stupider mechanism than the one outlined in
the draft. It caches *all* SERVFAIL responses, regardless of the reason
they were generated. For example: when the cache is cold, a query may time
out or hit DDoS mitigation limits before it's finished getting through the
whole iteration process; an immediate retry would start further along the
delegation chain and would succeed. Such problems weren't noticeable until
we implemented the 10-second cache, but became very noticeable afterward.

If we were able to selectively cache *only* those SERVFAILs that are
unlikely to recover soon, then five seconds might indeed be a good starting
point. But, with our relatively dumb cache, we found that one second did a
fairly good job reducing the processing burden from repeated queries, and
eliminated the user complaints about the resolver taking forever to recover
from short-lived problems. It's been working well enough that it hasn't
been a priority to develop a more complex failure cache.

In any case, even with the assumption that future implementations *will*
have better selectiveness, I'm leery of using 5 seconds as hard minimum in
an RFC.  I think it's likely that some operators will find that excessive
and want the option to tune it to a lower value.

Also, if you *are* doing exponential backoff, then two failures in a row
will get your duration up to 4 seconds anyway, so the difference between
starting at 1 and starting 5 isn't really all that significant.

> > * Note that the original text has this as SHOULD. I've heard reasons for
> > both SHOULD and MAY.
> 
> What are these reasons?

I suggested MAY because I think exponential backoff is a pretty specific
(and rather aggressive) approach to cache timing, and I'm not entirely
comfortable with it having the almost-mandatory force of a SHOULD.

The original text says a series of seven resolution failures would increase
the duration before a retry to five minutes: 5 seconds to 10 to 20 to 40 to
80 to 160 to 300. Lowering the starting value to one second means it would
take nine failures to reach 300.

IMHO, keeping the recovery period flat, or increasing it linearly (5, 10,
15, etc), could also be operationally reasonable choices, so I'm not sure
why we need to be so emphatic about *this* particular backoff strategy in
the RFC.

I have no objection to mentioning it, but it felt like a MAY to me.  It's a
mild preference though, and if I'm the only one who feels that way, I won't
argue about it further.

-- 
Evan Hunt -- e...@isc.org
Internet Systems Consortium, Inc.

___
DNSOP mailing list
DNSOP@ietf.org
https://www.ietf.org/mailman/listinfo/dnsop


Re: [DNSOP] A question on values in draft-dnsop-caching-resolution-failures

2023-07-23 Thread Mukund Sivaraman
Hi Tim

On Sun, Jul 23, 2023 at 09:00:58PM -0700, Tim Wicinski wrote:
> There was some operational feedback that suggests 1 second is also
> a very reasonable value here.  With some discussion, here is some
> suggested text:
> 
> Resolvers MUST cache resolution failures for at least 1 second.

When seeing prescriptive text, implementors often wants to know the
rationale behind it. If the value of 5 is changed to 1, please mention
and have the authors include in the document why the lower limit is
1s. Is it an arbitrary change? Is this change based on the default value
of BIND's servfail-ttl named.conf option?

Sometimes the reason for decisions is found in the mailing list
archives, but not always.

> The initial duration SHOULD be configurable by the operator.  A

[snip]

> * Note that the original text has this as SHOULD. I've heard reasons for
> both SHOULD and MAY.

What are these reasons?

Mukund


signature.asc
Description: PGP signature
___
DNSOP mailing list
DNSOP@ietf.org
https://www.ietf.org/mailman/listinfo/dnsop


Re: [DNSOP] A question on values in draft-dnsop-caching-resolution-failures

2023-07-23 Thread Tim Wicinski
Duane

On Sun, Jul 23, 2023 at 9:20 PM Wessels, Duane 
wrote:

> Tim,
>
> You said you received some operational feedback.  I wonder if it would be
> appropriate to add this operational (or implementation?) feedback to the
> (currently empty) Implementation Status section that Peter van Dijk
> suggested we add, in his DNS directorate review?
>

This seems very reasonable.  I'll work with them on this.


> I’m not necessarily opposed to reducing the minimum caching time from 5 to
> 1, especially if we can document valid reasons for doing so.  However, I do
> think it is going a bit to far to weaken both the minimum caching time and
> the requirement level for exponential backoff.  So I would really argue to
> keep the SHOULD in the second paragraph.
>
> Alternatively, we might consider something like 5 seconds without an
> exponential backoff implementation OR an initial 1 second cache time with
> an exponential backoff.
>

My first opinion is to keep the SHOULD, and let folks speak up if they feel
this is wrong.

I liked the way you worded the timeout range in section 3.1.

I do think/feel the value should be configurable by the operator.

thanks
tim



> DW
>
>
>
> > On Jul 23, 2023, at 9:00 PM, Tim Wicinski  wrote:
> >
> >
> >
> > All,
> >
> > We had a discussion this morning during the hackathon about a value with
> > the document caching-resolution-failures.  The current text in 3.2 says:
> >
> > Resolvers MUST cache resolution failures for at least 5 seconds.  The
> > value of 5 seconds is chosen as a reasonable amount of time that an
> > end user could be expected to wait.
> >
> > Resolvers SHOULD employ an exponential backoff algorithm to increase
> > the amount of time for subsequent resolution failures.  For example,
> > the initial time for negatively caching a resolution failure is set
> > to 5 seconds.  The time is doubled after each retry that results in
> > another resolution failure.  Consistent with [RFC2308], resolution
> > failures MUST NOT be cached for longer than 5 minutes.
> >
> >
> > There was some operational feedback that suggests 1 second is also
> > a very reasonable value here.  With some discussion, here is some
> > suggested text:
> >
> > Resolvers MUST cache resolution failures for at least 1 second.
> > The initial duration SHOULD be configurable by the operator.  A
> > longer cache duration for resolution failures will reduce the
> > processing burden from repeated queries, but will also lengthen
> > the recovery period from transitory issues.
> >
> > Resolvers MAY* employ an exponential backoff algorithm to increase
> > the cache duration when resolution failures are persistent.  For
> > example, the initial time for negatively caching a resolution
> > failure could be set to 5 seconds, and doubled after each retry
> > that results in another resolution failure, up to a configurable
> > maximum.
> >
> > Consistent with [RFC2308], resolution failures MUST NOT be cached
> > for longer than 5 minutes.
> > ---
> >
> > * Note that the original text has this as SHOULD. I've heard reasons for
> both SHOULD and MAY.
> >
> > We'd like to hear from the working group on this value, and what the
> working group thinks of this change
> >
> > thanks
> > tim
> >
> > ___
> > DNSOP mailing list
> > DNSOP@ietf.org
> >
> https://secure-web.cisco.com/1EOBeLhMBEWg1uxqfTYxtUCMTcEb3F3FEA2EO7c3JOioTtVNfCLJH16XnnbuotVr49ldBsx_KxI4Vx5CjDqNuYdQ17vtalwP-jShq2peErxec4rVO5LJ33FG2rYySJ-hZugq-0SR7DVGxYLZEl-uJBfoRv8Zktrm5CSMGpC4jjfksy9itIXwMXbnVKRQ8qOV2E-xDb5PqUtQMLBambGxjnlXoTHtQl2dqFRx1kA7Tyg6-9vnpU5kAoRVbl_5ghCwqXM4Go0HV4s-Z-P0vPvWnuXP40ATm_rhOsymJUvwkppy58V9UrsCxC81vA7ic1gIe/https%3A%2F%2Fwww.ietf.org%2Fmailman%2Flistinfo%2Fdnsop
>
>
___
DNSOP mailing list
DNSOP@ietf.org
https://www.ietf.org/mailman/listinfo/dnsop


Re: [DNSOP] A question on values in draft-dnsop-caching-resolution-failures

2023-07-23 Thread Wessels, Duane
Tim,

You said you received some operational feedback.  I wonder if it would be 
appropriate to add this operational (or implementation?) feedback to the 
(currently empty) Implementation Status section that Peter van Dijk suggested 
we add, in his DNS directorate review?

I’m not necessarily opposed to reducing the minimum caching time from 5 to 1, 
especially if we can document valid reasons for doing so.  However, I do think 
it is going a bit to far to weaken both the minimum caching time and the 
requirement level for exponential backoff.  So I would really argue to keep the 
SHOULD in the second paragraph.

Alternatively, we might consider something like 5 seconds without an 
exponential backoff implementation OR an initial 1 second cache time with an 
exponential backoff.

DW


 
> On Jul 23, 2023, at 9:00 PM, Tim Wicinski  wrote:
> 
> 
> 
> All,
> 
> We had a discussion this morning during the hackathon about a value with 
> the document caching-resolution-failures.  The current text in 3.2 says:
> 
> Resolvers MUST cache resolution failures for at least 5 seconds.  The
> value of 5 seconds is chosen as a reasonable amount of time that an
> end user could be expected to wait.
> 
> Resolvers SHOULD employ an exponential backoff algorithm to increase
> the amount of time for subsequent resolution failures.  For example,
> the initial time for negatively caching a resolution failure is set
> to 5 seconds.  The time is doubled after each retry that results in
> another resolution failure.  Consistent with [RFC2308], resolution
> failures MUST NOT be cached for longer than 5 minutes.
> 
> 
> There was some operational feedback that suggests 1 second is also 
> a very reasonable value here.  With some discussion, here is some 
> suggested text:
> 
> Resolvers MUST cache resolution failures for at least 1 second.
> The initial duration SHOULD be configurable by the operator.  A
> longer cache duration for resolution failures will reduce the
> processing burden from repeated queries, but will also lengthen
> the recovery period from transitory issues.
> 
> Resolvers MAY* employ an exponential backoff algorithm to increase
> the cache duration when resolution failures are persistent.  For
> example, the initial time for negatively caching a resolution
> failure could be set to 5 seconds, and doubled after each retry
> that results in another resolution failure, up to a configurable
> maximum.
> 
> Consistent with [RFC2308], resolution failures MUST NOT be cached
> for longer than 5 minutes.
> ---
> 
> * Note that the original text has this as SHOULD. I've heard reasons for both 
> SHOULD and MAY. 
> 
> We'd like to hear from the working group on this value, and what the working 
> group thinks of this change
> 
> thanks
> tim
> 
> ___
> DNSOP mailing list
> DNSOP@ietf.org
> https://secure-web.cisco.com/1EOBeLhMBEWg1uxqfTYxtUCMTcEb3F3FEA2EO7c3JOioTtVNfCLJH16XnnbuotVr49ldBsx_KxI4Vx5CjDqNuYdQ17vtalwP-jShq2peErxec4rVO5LJ33FG2rYySJ-hZugq-0SR7DVGxYLZEl-uJBfoRv8Zktrm5CSMGpC4jjfksy9itIXwMXbnVKRQ8qOV2E-xDb5PqUtQMLBambGxjnlXoTHtQl2dqFRx1kA7Tyg6-9vnpU5kAoRVbl_5ghCwqXM4Go0HV4s-Z-P0vPvWnuXP40ATm_rhOsymJUvwkppy58V9UrsCxC81vA7ic1gIe/https%3A%2F%2Fwww.ietf.org%2Fmailman%2Flistinfo%2Fdnsop

___
DNSOP mailing list
DNSOP@ietf.org
https://www.ietf.org/mailman/listinfo/dnsop


[DNSOP] A question on values in draft-dnsop-caching-resolution-failures

2023-07-23 Thread Tim Wicinski
All,

We had a discussion this morning during the hackathon about a value with
the document caching-resolution-failures.  The current text in 3.2 says:

Resolvers MUST cache resolution failures for at least 5 seconds.  The
value of 5 seconds is chosen as a reasonable amount of time that an
end user could be expected to wait.

Resolvers SHOULD employ an exponential backoff algorithm to increase
the amount of time for subsequent resolution failures.  For example,
the initial time for negatively caching a resolution failure is set
to 5 seconds.  The time is doubled after each retry that results in
another resolution failure.  Consistent with [RFC2308], resolution
failures MUST NOT be cached for longer than 5 minutes.


There was some operational feedback that suggests 1 second is also
a very reasonable value here.  With some discussion, here is some
suggested text:

Resolvers MUST cache resolution failures for at least 1 second.
The initial duration SHOULD be configurable by the operator.  A
longer cache duration for resolution failures will reduce the
processing burden from repeated queries, but will also lengthen
the recovery period from transitory issues.

Resolvers MAY* employ an exponential backoff algorithm to increase
the cache duration when resolution failures are persistent.  For
example, the initial time for negatively caching a resolution
failure could be set to 5 seconds, and doubled after each retry
that results in another resolution failure, up to a configurable
maximum.

Consistent with [RFC2308], resolution failures MUST NOT be cached
for longer than 5 minutes.
---

* Note that the original text has this as SHOULD. I've heard reasons for
both SHOULD and MAY.

We'd like to hear from the working group on this value, and what the
working group thinks of this change

thanks
tim
___
DNSOP mailing list
DNSOP@ietf.org
https://www.ietf.org/mailman/listinfo/dnsop