Re: [Pdns-dev] PDNS Recursor functionality request re:SERVFAIL outages of today

John Todd Fri, 21 Oct 2016 22:13:03 -0700


On 21 Oct 2016, at 18:48, Greg Owen wrote:

On 2016-10-21 19:53, John Todd wrote:
I’d like to propose an extension to PowerDNS Recursor formitigating
(partially) events like we had today where major authoritative
nameservers were put out of commission. This might be a particularly
foolish or error-prone method - it only took me a few minutes tothinkup. But I’d at least like to hear a discussion as to why thisisn’t agood idea. The comment of “But this might end up giving out thewronganswer!” is true, but I view a wrong answer as better than noanswer.
...
If that query fails due to a SERVFAIL, then the TTL timer on this
“old” record is set back to zero and the “old” record isprovided as
a response. If an authoritative server is marked as “down” due to
repeated SERVFAIL responses (see packetcache-servfail-ttl) then the
“old” record is handed back immediately without a new queryattempt,
and the TTL timer is set back to zero to keep the answer in a state
of perpetual validity as long as....
There are security concerns to doing this. Most simply, a wronganswer is worse than no answer if the "wrong" answer is a maliciouslysourced record.
Consider the two following cases:
1) The attacker poisons the records for a zone - either indirectly, orvia compromise of the actual authoritative servers - and then takesthe actual servers down hard, causing SERVFAIL until the owners eitherfix the servers, weather the DDoS, or redirect the root NS records.
2) The attacker poisons the records directly via compromise of theactual authoritative servers, and the owner takes the servers downuntil they can be replaced with clean, secured versions.
In these two cases, the measure you're proposing would persist themalicious entries past their expiration and for the duration of theattack's effectiveness on the authoritative servers.
Even if your measure is triggered manually - in today's event, forexample, one says "Gosh, I know records are offline because of a DynDDoS, so I know I can compensate by saving records, throw the switch!"- let's say that someone DDoSed *ALL* of Dyn after poisoning recordsfor a single zone. You'd have no way of knowing - until the incidentis over and forensic analysis has hopefully caught that nuance - thatyou were doing the attacker's work for them.
These attack vectors are not without precedent. So-called "Dark DDoS"attacks have been used to distract and mislead defenders, providing asmoke screen for other more direct attacks:
http://www.infosecurity-magazine.com/opinions/dark-ddos-growing-cyber-security/
So, in short, your proposal has the caveat that it may extend thedamage from an attack in more pernicious ways than simple denial ofservice. (I'd rather not get to my bank than get to an impostorposing as my bank!)
...
I agree it's worth putting some thought into how to increaseredundancy and flexibility to compensate for these infrastructureattacks. For example, perhaps taking your idea but only applying itto signed DNSSEC records which have slightly higher data integrity?It's definitely worth exploring, but let's be careful of known andreasonable ways attackers could take advantage of this compensation.
Thx,
gowen

--
    gowen - Greg Owen - go...@swynwyr.com
    CISSP, GCIA, GCFA, GWAPT

I agree with your caveats to a degree, but I can only imagine theresults being “worse” in a few edge cases of not-as-clever attacks.

In both the first and second case you describe above, would it not bethe case that a sophisticated attacker would give an unusually large TTLto the poisoned record in order to avoid repair attempts? A TTL can be(if I’m reading the RFC correctly) 68 years. I would expect poisonedentries to be at least 3600 seconds (which is also the default value inpacketcache-ttl) if not significantly more, but I can’t say I’veever paid attention to that number when looking at forensic data online,and perhaps I overestimate the baseline sophistication of attackers -but I don’t think so.

Of course, we’re trying via a number of other methods to eliminatecache poisoning, so that’s a first step on your case #1. DNSSEC is thebest method I can see at the moment for this, so at a minimum it doesseem that this extended TTL timer would work with reasonable expected“good” results on those records as you suggest, but I don’t thinkit should be limited to just DNSSEC-secured records. In case #2, Ican’t imagine that a domain operator would have their servers offlineintentionally for longer than the TTL of the poisoned record - are thereinstances where nameservers are down for several hours intentionallyafter a breach? In that case, there’s at least an hour of “bad”data infecting various recursive servers, and I imagine whatever damagethat is to happen is significantly done after an hour, and one wouldhope that alternate methods (SSL, or DANE!) would provide an additionallayer of security. I am not suggesting that no additional damage willbe done during the TTL extension period, but that the cases where thatoccurs are few and the benefit of operational continuity duringauthoritative server outages outweighs the risk of longer-durationfailure modes.

My assertion is: given an attacker with even the most modestlyintelligent attack method, I would expect the long indefinite extensionof TTL in the case of SERVFAIL will probably result in conditions notsignificantly worse for end users than if the SERVFAIL TTL extensionmethod were not used, even in conditions where a poisoned record isinserted into the cache. No _new_ failure modes or results are beingintroduced by this method.

Implementing the timer of course, could also be an optional method witha default of “0”, giving the recursor operator the flexibility forenabling/disabling given the requirements of their user community. Ican also imagine an additional counter option on this method whichlimits the maximum number of times a TTL may be overridden on a record,or an ultimate maximum TTL. There may be other more complex ways ofallowing a domain operator to signal behavior in SERVFAIL conditions fora particular zone or record, such as TXT tags or possibly SRV recordtypes, but they imply lookup and caching before a SERVFAIL conditionwhich has a slew of unsatisfactory traffic and conditional state-keepingissues and possible self-referential loop failures, and chances ofwidespread adoption in a reasonable timeframe are fairly low thoughI’m sure it would make for an interesting IETF sub-track. A TTL onTTL? Ugh. Keeping this simple seems to be the best way forward.

I believe putting this timer override method in the resolver is thefastest way to give local resiliency to resolvers faced withauthoritative server outages.

JT

_______________________________________________
Pdns-dev mailing list
Pdns-dev@mailman.powerdns.com
https://mailman.powerdns.com/mailman/listinfo/pdns-dev

Re: [Pdns-dev] PDNS Recursor functionality request re:SERVFAIL outages of today

Reply via email to