Re: [DNSOP] comments on draft-ietf-dnsop-serve-stale-03
On Tue, Mar 26, 2019 at 2:19 PM Dave Lawrence wrote: > > On the other hand I have direct operational experience that says if a > problem is being caused not by a generalized DOS or other transient > network issue, then it can indeed take multiple days to resolve. > Start of a long weekend? Trying to reach the right people to fix it? > Surely you've experienced customers not responding quite as quickly to > fix their problems as you'd like. > > So I'm not so keen on one day, but could see dropping the > recommendation to 3. It is, after all, still just a recommendation > and one that should be configurable. > Yes, I remember that this has happened a few times at large scale during the last few years. I'm a bit worried if it would still cause more problems than it would solve. ___ DNSOP mailing list DNSOP@ietf.org https://www.ietf.org/mailman/listinfo/dnsop
Re: [DNSOP] comments on draft-ietf-dnsop-serve-stale-03
On Tue, Mar 26, 2019 at 12:48 PM Tony Finch wrote: >> I think the suggested max stale timer of 7 days is excessive. The aim is >> to cope with an outage, so I think 1 day is much more reasonable (though I >> have configured my servers with a 1 hour limit). Olli Vanhoja writes: > I agree. At least based on my own experience, all the network or other > downtime issues I have experienced last only few minutes. Okay, I agree a little that 7 days is probably excessive as a recommendation, though not harmful. I also agree that in most instance where serve-stale has already proven itself to be useful, the events are fairly short-lived. On the other hand I have direct operational experience that says if a problem is being caused not by a generalized DOS or other transient network issue, then it can indeed take multiple days to resolve. Start of a long weekend? Trying to reach the right people to fix it? Surely you've experienced customers not responding quite as quickly to fix their problems as you'd like. So I'm not so keen on one day, but could see dropping the recommendation to 3. It is, after all, still just a recommendation and one that should be configurable. > If there is a downtime longer than that and it's only affecting DNS, > I would seriously consider changing my service providers and > vendors, whatever is the issue. Right! But in the meantime, until that change is done ... ___ DNSOP mailing list DNSOP@ietf.org https://www.ietf.org/mailman/listinfo/dnsop
Re: [DNSOP] comments on draft-ietf-dnsop-serve-stale-03
On Tue, Mar 26, 2019 at 12:48 PM Tony Finch wrote: > > Dave Lawrence wrote: > > Ray Bellis writes: > > > Serve stael must not become a vector whereby malware can keep its C&C > > > systems artificially alive even if the parent has removed the C&C domain > > > name. > > > > I wholeheartedly agree with this ideal, and am very open to > > considering text improvements. > > I think the suggested max stale timer of 7 days is excessive. The aim is > to cope with an outage, so I think 1 day is much more reasonable (though I > have configured my servers with a 1 hour limit). > I agree. At least based on my own experience, all the network or other downtime issues I have experienced last only few minutes. If there is a downtime longer than that and it's only affecting DNS, I would seriously consider changing my service providers and vendors, whatever is the issue. ___ DNSOP mailing list DNSOP@ietf.org https://www.ietf.org/mailman/listinfo/dnsop
Re: [DNSOP] comments on draft-ietf-dnsop-serve-stale-03
Dave Lawrence wrote: > Ray Bellis writes: > > Serve stael must not become a vector whereby malware can keep its C&C > > systems artificially alive even if the parent has removed the C&C domain > > name. > > I wholeheartedly agree with this ideal, and am very open to > considering text improvements. I think the suggested max stale timer of 7 days is excessive. The aim is to cope with an outage, so I think 1 day is much more reasonable (though I have configured my servers with a 1 hour limit). Tony. -- f.anthony.n.finchhttp://dotat.at/ Shetland Isles: West or southwest 5 or 6, decreasing 3 or 4 for a time, occasionally 7 at first. Moderate in sheltered east, otherwise rough, occasionally very rough at first in west. Occasional rain. Moderate or good, occasionally poor. ___ DNSOP mailing list DNSOP@ietf.org https://www.ietf.org/mailman/listinfo/dnsop
Re: [DNSOP] comments on draft-ietf-dnsop-serve-stale-03
Puneet Sood wrote on 2019-03-25 08:07: Hi Paul, On Sun, Mar 24, 2019 at 12:37 PM Paul Vixie wrote: i object to serve-stale as proposed. my objection is fundamental and goes to the semantics. no editorial change would resolve the problem. i would withdraw that objection if this draft incorporates section 2 of https://tools.ietf.org/html/draft-vixie-dnsext-resimprove-00, to wit: I went back and read the discussion on this draft and I could not find consensus on adopting it at that time. noone understood it. there was no reply, positive or negative, to my proposal that the WG adopt it. it dropped like a brick, in silence. i believe that the issues are, now, nine years later, better understood. ... I do not think adding it to the serve-stale draft will make path for adoption for either the serve-stale draft or these recommendations easier. understood. we disagree, but, i understand your position. https://tools.ietf.org/html/draft-ietf-dnsop-serve-stale-04 section 5 (page 6, paragraph 2) talks about refreshing the delegation. Quote: + When no authorities are able to be reached during a resolution + attempt, the resolver SHOULD attempt to refresh the delegation and + restart the iterative lookup process with the remaining time on the + query resolution timer. This resumption should be done only once + during one resolution effort. Maybe there are fewer, specific bits from your draft which would be appropriate in the serve-stale context? Would you be willing to discuss those? yes, certainly. the text i quoted from the 2010 resimprove draft was the result of implementation experience, and touches on necessary details, which are at the moment missing from serve-stale. i'll be very happy to discuss these, though regrettably, i could only be in prague for sunday, and so i'll miss any side meetings or working group discussions this time. no disrespect is intended by my absence, i had other obligations. -- P Vixie ___ DNSOP mailing list DNSOP@ietf.org https://www.ietf.org/mailman/listinfo/dnsop
Re: [DNSOP] comments on draft-ietf-dnsop-serve-stale-03
Ray Bellis writes: > Serve stael must not become a vector whereby malware can keep its C&C > systems artificially alive even if the parent has removed the C&C domain > name. I wholeheartedly agree with this ideal, and am very open to considering text improvements. The document already has guidance on this point, but it is admittedly in a considerations section and not in standards action, and is a weaker "SHOULD" versus "MUST" right now. Would the WG prefer that a line like this be put into the Standards Action section? When no authorities for a name are able to be reached, the resolver MUST attempt to refresh the delegation. I like the basic idea but am a little stuck on the wording because of the endless loop it implies. This is probably why it appears as "SHOULD" already (but I honestly don't remember, so there's that). It seems to me that the risk is very low, even as currently written in the draft. Not only do I have a lot of confidence in the implementers of the most widely used resolvers in the world, as they're right here participating too and have in the past shown good conscientiousness in this area, but the practical attack is still hard to make meaningful. If "the parent has removed the C&C domain name" as you said, serve-stale shouldn't even kick in. NxDomain, problem solved. Various other scenarios come to mind, even with obstinate parents that won't remove the delegation and the zone's NSs have gone dark, but those scenarios have other possible remedies. And fast flux using long TTL NS RRsets are an issue no matter whether serve-stale is in play or not. So text regarding refreshing delegations could be given even more words to describe backoff intervals and such, but to what end? What's the scenario? And wouldn't it be handled better by reviving resimprove to talk about the generalized problem? (To be clear, I'm quite okay with politely being shown that I'm wrong and there is a vector by which serve-stale becomes uniquely interesting, and would certainly endeavour to make sure it is addressed.) ___ DNSOP mailing list DNSOP@ietf.org https://www.ietf.org/mailman/listinfo/dnsop
Re: [DNSOP] comments on draft-ietf-dnsop-serve-stale-03
bert hubert writes: > I too object. This is partially due to the apparently unresolved IPR issue > from Akamai, who are known not to be shy asserting their IPR. This is definitely a problem. Even though Akamai had previously agreed to specify under what IETF-acceptable terms the IPR would be made available, it clearly hasn't yet specified them. I've contacted them to get a timeline on when the legal department can take care of this, and the first order response is that the DNS team is trying to get the ball rolling again with legal this week. > My secondary objection is that the draft contains an example > implementation that then however uses normative words. This leads to > pain with operators demanding serve-stale compliance. Example > algorithms should clearly be examples and not tell us what we SHOULD > do. As previously noted in this very thread, at least one of the authors, Puneet, agrees with you. When I wrote the text that way it was because of the also not-unreasonable viewpoint that if someone were to be implementing the example then the text could be considered normative as to how to do that. It's even softened by having no MUSTs at all, just SHOULD. In addition, I'm dubious as to the claim that people would cause meaningful pain to demand compliance with an example, and not be adequately refuted when it is pointed out to them that it is a clearly marked as an example. That said, since I had waffled on it myself at time of composition and I don't actually have a very strong feeling about it, in the end wouldn't fight over downcasing it. Yet I think it should be settled at a level above dnsop because ultimately it's an issue that should be consistent across IETF documentation. Unless there's already an explicitly stated IETF policy about this, and not just ad hoc past cases to point to, I think it is best to sort out with the RFC editor. ___ DNSOP mailing list DNSOP@ietf.org https://www.ietf.org/mailman/listinfo/dnsop
Re: [DNSOP] comments on draft-ietf-dnsop-serve-stale-03
Paul Vixie writes: > i would withdraw that objection if this draft incorporates section 2 of > https://tools.ietf.org/html/draft-vixie-dnsext-resimprove-00, to wit: I always liked resimprove. Warren and I were talking about it, and if you would like we'd be quite happy to pick it up and get it moving in dnsop. This document already has text, however, for refreshing the delegation and I don't believe it really needs to much detail as to what that means. "Delegation Revalidation Upon NS RRset Expiry" is an issue orthogonal to serve-stale, and in fact most often applies when serve-stale isn't even being triggered; it's a regular occurrence under normal operations. Ballooning the standards action section here to nearly three times its existing size is unnecessary bloat. Please let us know if you'd like us to take up the charge for resimprove. ___ DNSOP mailing list DNSOP@ietf.org https://www.ietf.org/mailman/listinfo/dnsop
Re: [DNSOP] comments on draft-ietf-dnsop-serve-stale-03
On Mon, Mar 25, 2019 at 04:30:01PM +0100, Ray Bellis wrote: > > > On 25/03/2019 16:08, Puneet Sood wrote: > > > you mean lots of changes or lots of agreement with the quoted text? > > They mean very different things. > > I was agreeing with the quoted text - I believe that any serving of > stale records must be predicated on the presence of a valid delegation > from the parent zone. > > Serve stael must not become a vector whereby malware can keep its C&C > systems artificially alive even if the parent has removed the C&C domain > name. +1 Fred ___ DNSOP mailing list DNSOP@ietf.org https://www.ietf.org/mailman/listinfo/dnsop
Re: [DNSOP] comments on draft-ietf-dnsop-serve-stale-03
On 25/03/2019 16:08, Puneet Sood wrote: you mean lots of changes or lots of agreement with the quoted text? They mean very different things. I was agreeing with the quoted text - I believe that any serving of stale records must be predicated on the presence of a valid delegation from the parent zone. Serve stael must not become a vector whereby malware can keep its C&C systems artificially alive even if the parent has removed the C&C domain name. Ray ___ DNSOP mailing list DNSOP@ietf.org https://www.ietf.org/mailman/listinfo/dnsop
Re: [DNSOP] comments on draft-ietf-dnsop-serve-stale-03
On 24/03/2019 12:36, Paul Vixie wrote: in other words, we'd be negotiating for the right to re-interpret existing signaling (the authority's TTL no longer purely governs the data's lifetime) by insisting that the parent zone's delegating TTL be given absolute power for revocation. +lots Ray ___ DNSOP mailing list DNSOP@ietf.org https://www.ietf.org/mailman/listinfo/dnsop
Re: [DNSOP] comments on draft-ietf-dnsop-serve-stale-03
On Sun, Mar 24, 2019 at 04:36:50AM -0700, Paul Vixie wrote: > i object to serve-stale as proposed. my objection is fundamental and goes to > the semantics. no editorial change would resolve the problem. I too object. This is partially due to the apparently unresolved IPR issue from Akamai, who are known not to be shy asserting their IPR. https://datatracker.ietf.org/ipr/3014/ notes an Akamai IPR claim and does not yet provide a license suitable for use on an open internet. https://patents.google.com/patent/US8583801B2/en & https://en.wikipedia.org/wiki/Akamai_Techs.,_Inc._v._Limelight_Networks,_Inc. have some context. The mechanics are that once something is an RFC, operators require adherance to it. This in turn is a boon for the IPR holder, and hurts everyone else. My secondary objection is that the draft contains an example implementation that then however uses normative words. This leads to pain with operators demanding serve-stale compliance. Example algorithms should clearly be examples and not tell us what we SHOULD do. Bert ___ DNSOP mailing list DNSOP@ietf.org https://www.ietf.org/mailman/listinfo/dnsop
Re: [DNSOP] comments on draft-ietf-dnsop-serve-stale-03
i object to serve-stale as proposed. my objection is fundamental and goes to the semantics. no editorial change would resolve the problem. i would withdraw that objection if this draft incorporates section 2 of https://tools.ietf.org/html/draft-vixie-dnsext-resimprove-00, to wit: 2. Delegation Revalidation Upon NS RRSet Expiry 2.1. Because the delegating NS RRset at the bottom of the parent zone and the apex NS RRset in the child zone are unsynchronized, the TTL of the parent's delegating NS RRset is meaningless. A child zone's apex NS RRset is authoritative and thus has a higher cache credibility than the parent's delegating NS RRset, so, the NS RRset "below the cut" immediately replaces the parent's delegating NS RRset in cache when an iterative caching DNS resolver crosses a zone cut. 2.2. The lowest TTL found in a parent zone's delegating NS RRset should be stored in the cache and used to trigger delegation revalidation as follows. Whenever a cached RRset is being considered for use in a response, the cache should be walked upward toward the root, looking for expired delegations. At the first expired delegation encountered while walking upward toward the root, revalidation should be triggered, putting the processing of dependent queries on hold until validation is complete. 2.3. To revalidate a delegation, the iterative caching DNS resolver will forward the query that triggered revalidation to the nameservers at the closest enclosing zone cut above the revalidation point. While searching for these nameservers, additional revalidations may occur, perhaps placing an entire chain of dependent queries on hold, unwinding in downward order as revalidations closer to the root must be complete before revalidations further from the root can begin. 2.4. If a delegation can be revalidated at the same node, then the old apex NS RRset should be deleted from cache and then the new delegating NS RRset should be stored in cache. The minimum TTL from the new delegating NS RRset should also be stored in cache to facilitate future revalidations. This order of operations ensures that the RRset credibility rules do not prevent the new delegating NS RRset from entering the cache. It is expected that the child's apex NS RRset will rapidly replace the parent's delegating NS RRset as soon as iteration restarts after the revalidation event. 2.5. If the new delegating NS RRset cannot be found (RCODE=NXDOMAIN) or if there is a new zone cut at some different level of the hierarchy (insertion or deletion of a delegation point above the revalidation point) or if the new RRset shares no nameserver names in common with the old one (indicating some kind of redelegation, which is rare) then the cache should be purged of all names and RRsets at or below the revalidation point. This facilitates redelegation or revocation of a zone by a parent zone administrator, and also conserves cache storage by deleting unreachable data. 2.6. To make the timing of a revalidation event unpredictable from the point of view of a potential cache-spoof attacker, the parent's delegating NS RRset TTL should be reduced by a random fraction of its value before being stored for use in revalidation activities. in other words, we'd be negotiating for the right to re-interpret existing signaling (the authority's TTL no longer purely governs the data's lifetime) by insisting that the parent zone's delegating TTL be given absolute power for revocation. vixie ___ DNSOP mailing list DNSOP@ietf.org https://www.ietf.org/mailman/listinfo/dnsop
Re: [DNSOP] comments on draft-ietf-dnsop-serve-stale-03
Hi, On 08/03/2019 21:29, Dave Lawrence wrote: > Huh, My understanding from a hallway conversation with Benno was that > the immediate response is only sent for names that would have been > subject to pre-fetching, such that the immediate response in this case > is sufficiently covered under the guidance of a recent attempt being > made. If that is not the case, and you can get stale answers from > Unbound even without a recent refresh attempt, then I personally think > that is an error in Unbound and not this document. The current implementation of serve stale in Unbound is closely related with the pre-fetching process. It works well for most cases, that is names that are frequently queried for, so the pre-fetch assures for fresh and correct entries in the cache. For names with relative short TTLs and that are not frequently queried for (i.e. less frequent than covered by the TTL), the entry from the cache is stale and only after serving the reply, a pre-fetch (resolve) is initiated to update/re-fresh the entry in the cache. We acknowledge this behavior is not optimal in some situations and will reimplement a part of the re-fresh strategy of cache entries. -- Benno -- Benno J. Overeinder NLnet Labs https://www.nlnetlabs.nl/ ___ DNSOP mailing list DNSOP@ietf.org https://www.ietf.org/mailman/listinfo/dnsop
Re: [DNSOP] comments on draft-ietf-dnsop-serve-stale-03
Thank you very much for the feedback, Jinmei. Combined with previous changes we made following the other messages on the draft we expect to republish it before the Monday IETF 104 submission deadline, after one last review by all of the co-authors. Jinmei: >> The definition of TTL in [RFC1035] Sections 3.2.1 and 4.1.3 is >> amended to read: >> >>TTL [...] If the authority for the data is unavailable >>when attempting to refresh, the record MAY be used as though it is >>unexpired. > > On understanding that this is the only real normative description, > I'd suggest making some more points explicit to prevent abusing of > this leniency: > - explicitly say "all authoritative servers" instead of just "the > authority" > - also explicitly note that this MUST NOT be allowed if at least one > authoritative server is available I can see how the phrasing of "if the authority" could imply to a novice in the DNS space that only one server would be tried, so I updated the wording to: If the data is unable to be authoritatively refreshed when the TTL expires, the record MAY be used as though it is unexpired. I think this phrasing is sufficient without needing to explicitly say all servers and must not if at least one responds. Existing resolver implementations are extremely thorough in trying to get authoritative answers and it is extremely hard to imagine that anyone would take this draft to mean that they should be any less thorough. > - clarify whether this means a 0-TTL record can be cached and reused > under this condition (I assume it must not, but it's not very clear > to me) Added this to the caveats section: The continuing prohibition against using data with a 0 second TTL beyond the current transaction explicitly extends to it being unusable even for stale fallback, as it is not to be cached at all. >> If it finds no relevant unexpired data and the Recursion Desired >> flag is not set in the request, it SHOULD immediately return the >> response without consulting the cache for expired records. > It would be nice if it clarified *what* to return in this case (if > it's intentionally left open, explicitly say so). Added: Typically this response would be a referral to authoritative nameservers covering the zone, but the specifics are implementation dependent. I was surprised to discover when testing against BIND 9.12 (without serve-stale in play) that dig +norec for an unknown example.com name gave a referral to com, even when it knew the NS for example.com either via the parent delegation or even from the apex. >> Outside the period of the resolution recheck timer, the resolver >> SHOULD start the query resolution timer and begin the iterative >> resolution process. > > It's not clear to me how this timer is related to the 'server-stale' > behavior; [...] I think it's main utility in the example method is to emphasize that even if you send a stale answer to the client while a lengthy resolution attempt is still playing out, you've got to keep trying. Admittedly capping the work of that lengthy attempt is not specifically relevant, but as you noted this is an example. I can see your point about possibly simplifying by removing a few sentences related to it, but as I also think that capping work is an important aspect of resiliency I'm inclined to leave it in. > this draft doesn't explain what happens when this timer > expires, for example. Based on "This timer bounds the work done by the resolver when contacting external authorities" I'd have thought it was implicitly clear, but I have added: If this timer expires on an attempted lookup that is still being processed, the resolution effort is abandoned. > Also, in my understanding unbound doesn't have this timer - it > eventually gives up a resolution if all possible external query > fails with a per-query timeout, but it doesn't cap the overall > resolution time. Interesting. I know of an Unbound-derived server that definitely caps work, though that may have been local changes and not incorporated into mainline. Tarpitting was a significant issue for the people involved. >> Stale data is used only when refreshing has failed, in order to >> adhere to the original intent of the design of the DNS and the >> behaviour expected by operators. > > I agree on this statement, but I wonder how widely this behavior is > actually implemented. As noted in Section 7, unbound doesn't behave > this way, and in my understanding it's intentional, mainly due to > a concern about related IPR. Huh, My understanding from a hallway conversation with Benno was that the immediate response is only sent for names that would have been subject to pre-fetching, such that the immediate response in this case is sufficiently covered under the guidance of a recent attempt being made. If that is not the case, and you can get stale answers from
[DNSOP] comments on draft-ietf-dnsop-serve-stale-03
I've read draft-ietf-dnsop-serve-stale-03. In addition to the high-level draft organization matter I mentioned in another thread, here are my other comments on this version: - Section 4: The definition of TTL in [RFC1035] Sections 3.2.1 and 4.1.3 is amended to read: TTL [...] If the authority for the data is unavailable when attempting to refresh, the record MAY be used as though it is unexpired. On understanding that this is the only real normative description, I'd suggest making some more points explicit to prevent abusing of this leniency: - explicitly say "all authoritative servers" instead of just "the authority" - also explicitly note that this MUST NOT be allowed if at least one authoritative server is available - clarify whether this means a 0-TTL record can be cached and reused under this condition (I assume it must not, but it's not very clear to me) - Section 5 If it finds no relevant unexpired data and the Recursion Desired flag is not set in the request, it SHOULD immediately return the response without consulting the cache for expired records. It would be nice if it clarified *what* to return in this case (if it's intentionally left open, explicitly say so). - Section 5 Outside the period of the resolution recheck timer, the resolver SHOULD start the query resolution timer and begin the iterative resolution process. It's not clear to me how this timer is related to the 'server-stale' behavior; this draft doesn't explain what happens when this timer expires, for example. Also, in my understanding unbound doesn't have this timer - it eventually gives up a resolution if all possible external query fails with a per-query timeout, but it doesn't cap the overall resolution time. That may not matter much as this section doesn't seem to be normative and it's just an implementation detail of a particular implementation, but if the role of this timer doesn't matter either, we might rather simplify the text by just omitting it. - Section 6 Stale data is used only when refreshing has failed, in order to adhere to the original intent of the design of the DNS and the behaviour expected by operators. I agree on this statement, but I wonder how widely this behavior is actually implemented. As noted in Section 7, unbound doesn't behave this way, and in my understanding it's intentional, mainly due to a concern about related IPR. If that's more common for other open source implementors (BIND 9 seems to work as described here, but I don't know about others), the description won't match the actual implementation behavior very well in reality. So I'm curious about implementation status about this point, and if many different implementations intentionally ignore this "caveat" for the same reason, I think we should adjust the text to match the reality. - Section 7 Unbound has a similar feature for serving stale answers, but will respond with stale data immediately if it has recently tried and failed to refresh the answer by pre-fetching. If I understand the implementation correctly, this is not 100% accurate: unbound always return the stale data if it's found in the cache as long as the "serve-expired" option is enabled. So I suggest revising the text to: Unbound has a similar feature for serving stale answers, but will respond with stale data immediately whenever the feature is enabled. -- JINMEI, Tatuya ___ DNSOP mailing list DNSOP@ietf.org https://www.ietf.org/mailman/listinfo/dnsop