Hi André,

I understand your concerns but still you can explicitly set rrsig-refresh.

Based on our experience DNS deployments are very diverse. So what is the right 
default value?

Regards,
Daniel

On 8/31/22 13:53, André Keller wrote:
Hi Libor,

On 31.08.22 13:14, libor.peltan wrote:
What rrsig-refresh actually serves for, is to refresh RRSIGs soon enough, so 
they they don't expire due to delays in:

1) propagation among authoritative servers, that means synchronization of 
secondaries with primaries, including e.g. the lengthy process of signing 
itself (in case of huge zone)
2) propagation to resolvers' caches

When I thought about this, I actually saw that (1) is exactly propagation-delay 
and (2) is exactly the RRSIG's TTL. Setting rrsig-refresh default to the sum of 
both values was a logical conclusion.

I see, I guess from a knot standpoint this makes sense, but I feel this does 
not take into account any potential delays that are caused by operational 
issues.
To paint a very simplified picture of our own architecture:
* We use Puppet Configuration management create/maintain the knot configuration 
on all involved servers
* We have a hidden primary, that holds the zonedata and does the signing
* We have public secondaries that sync these zones via the normal 
TSIG/AXFR/IXFR protocol
* Zonedata update on the hidden primary is done via a CI pipeline towards the 
hidden primary

So for the actual "public" facing service, only the secondaries are relevant as long as we do not need to change zonedata. That means the hidden primary also has no redundancy built in. If it breaks, we will simply redeploy it with puppet, rerun the pipeline and we are up and running again. However this would take time depending on when the outage is. So having signatures refresh early before they expire give us some headroom there were the secondaries can serve the current zonedata without being dependent on the primary.

Another issue I can think of, could be temporary network issues between the 
primary and the secondaries.


I'd say that the setting of propagation-delay is still in your hands, as well as setting non-default rrsig-refresh. The only disadvantage of too high rrsig-refresh is that zone signing takes place more often and creates larger change-sets to be propagated to secondaries. In other words, utilizing more of all resources (CPU, memory, disk, network).


For our deployment this is not really a concern. We do not have huge zones, we 
just have many of them. Also, they are mostly static. So signing performance 
was never an issue until now.

I would probably prefer to set a higher rrsig-refresh as compared to increase propagation-delay, it seems clearer to me what it does. Propagation delay for me is the time it takes during normal operations for all primaries and secondaries to be in sync, plus some margin for taking into account caching on resolvers. On top of that I'd like to have some sort of safety margin against operational issues, so setting rrsig-refresh is probably the way we go about in the future.


This all makes me think if the one-hour default of propagation-delay is maybe 
not optimal...?
Please let me know your ideas/opinions in more detail. Any real operational 
experience is very very valuable for us!


As already said, at least to me propagation-delay is not what I would associate with operational issues, I would expect all my primaries and secondary to be in sync during normal operation well within the default of one hour.

I guess choosing default values is always hard and I do not have an issue with making our configuration more explicit to cover our specific use case. I just wish this, at least for us, quite significant change in behavior would have been made a bit more obvious in the changelog. It caught us by suprise :)


Regards
André

--
--

Reply via email to