Geoff,

I'm happy to accept that the new wording is poor, but I'm pretty sure the old wording was also bad, and I think this discussion is important.

The old wording could easily be interpreted to suggest that once per day was the correct frequency for pulling from a repository. (That is, I believe the previous version was making a de facto recommendation for a default behaivor of one pull every 24 hours ... there wasn't a RECOMMEND in the text, but we all know that examples tend to be normative in this type of document.)

1) So the first implicit question is: Should the working group be making a recommendation as to the frequency with which a relying party pulls from the repository?

Or equivalently: Is there a "wrong" frequency that people might use if we didn't give them any guidence?

It seems that retreiving updates "too frequently" (e.g., every 5 minutes) strains the repository system and that retreiving updates "too infrequently" (e.g., monthly) means that when I inject a new ROA into the system, it will take "unacceptably long" for this information to propogate to the relying parties that make use of this information. Therefore, we should have text in the document that articulates some middle ground that we believe is reasonable for the Internet. (I make no claims that the current text in the document achieves this goal.)

2) The second question is: If we make a recommendation regarding frequency with which relying parties should pull updates, what frequency should we recommend.

Here, I understand that "everyone hitting the repository system at once" is a bad outcome regardless of the frequency that we recommend. That is, regardless of whether we recommend "once per day", "once per month", or "eight times daily" we will likely see problems with too much server load at midnight. If anyone can recommend text to avoid this phenomena (i.e., to encourage people to spread out their queries to the repository system), please send text.

I agree that there are roughly 30,000 AS numbers visible in BGP, so it's reasonable to assume on the order of 30,000 relying parties who will be routinely querying the repository system. We might also assume that 30,000 is a reasonable order of magnitude for the number of CAs in the RPKI (we might easily average 2 CAs per AS, but surely not 10 CAs per AS).

However, one thing that wasn't clear from reading your analysis was how many CAs a given repository server would be hosting. If a server run by a large ISP or an RIR was providing a cache of all RPKI data, then clients would have longer connections to this server (as they could retrieve much of the data they need in one place), but they would be unlikely to receive requests from all 30,000 relying parties (e.g. an ISP might provide a complete cache for their customers but for non-customers they would typically only serve data for which they are authoritative). Alternatively, if a server is only serving data for a small subset of the CAs in the RPKI, then it might receive requests from all relying parties, but those sessions would tend to be short (especially when nothing has changed).

In any case, I believe the way forward (with regards to server load) is to answer the question, "How many simultaneous connections are reasonable for a server that hosts publication points for X CAs?" and then work backwards from there to determine if a given interval of relying party requests is reasonable from the server standpoint. I admit that I haven't completely thought through re-key, but I'll try to dig up some rough connection-time numbers based on our relying party software, and do a few back-of-the envolope computations.

With regards to client load, I'm not convinced that there's any problem with frequent queries to the repository system. If the relying party queries a publication point and rsync determines that nothing has changed, then no changes are required to ethe relying party's local cache and no cryptographic calculations are required. If something has changed, then the relying party has to perform validation (which includes cryptographic signature verification) on the manifest and any new objects that have been added. (Additionally, there may be resulting changes to the client's local cache ... e.g., if a new CRL revokes a previously valid certificate ... but such changes don't require new cryptographic computations, and so I believe the bottleneck is going to be the one or two signature verifications per object changed [1]). Now the point from the relying party side is that if 5,000 manifests change and 10,000 signed objects are added to the repository system on a given day, then the relying party needs to do roughly 30,000 signature verifications regardless of whether it learns of all these changes at once, or whether it learns of them in small batches throughout the course of the day. Therefore, I don't see how making frequent checks for new data has a significant impact on the relying party's processing load.

Finally, in addition to server and relying party processing loads, one must also look at the benefit of frequent repository fetches. Keep in mind, that a relying party has no way of distinguishing the following two events: (A) a route advertisement is originated by an AS that is authorized to advertise the route, but the relying party hasn't fetched recently enough to obtain the new ROA; and (B) a route advertisement is originated by an unauthorized entity that is attempting to hijack address space. In this discussion, it is also important to note that manifests can gaurantee that the relying party received all signed objects that existed at the moment that the manifest was published (i.e., a manifest can detect malicious deletion of data from a repository or corruption of data in transit) but the manifest says nothing about data that may have been added since the manifest was issued. This is why there is benefit in a relying party going back to the publication point perioidically to see whether a new manifest has been issued.

In any case, it's good to know that we'll have plenty to talk about in Hiroshima.

- Matt Lepinski


Geoff Huston wrote:

WG Co-Chair Hat OFF

Hi Matt,



entities who are actually using RPKI data for routing SHOULD be fetching fresh data from the repositories at least once every three hours.


3 hours?

At a first pass that seems very frequent.

From a server's perspective if there are 30,000 AS's out there and each is running a local cache and each is a distinct relying party of the RPKI system, then the local hit rate at the server would be 3 per second, assuming that all the relying parties evenly spread their load (which is a pretty wild assumption - the worst case is that all 30,000 attempt to resync at the 3 hour clock chime point) Assuming that a repository sweep with no updates takes 30 seconds to complete then the server would have an average load of some 90 concurrent sync sessions. If there is a local rekey then the refresh would also imply a reload of all the signed products at this repository publication point. Assuming that this would then take 3 minutes to download, then the rekey load per server would be of the order of 540 concurrent rsync sessions as an average load. These load numbers appear to me to be somewhat large.

From the relying party's perspective if there are 30,000 distinct RPKI repository publication points, and a serial form of local synchronisation using a top-down tree walk then the same set of assumptions imply that the relying party's perspective then it needs to process the synchronisation with the remote cache (including minimally the manifest crypto calculation at a rate of 3 per second. Assuming that there are 200,000 distinct ROAs out there that are re- validated at each fetch then once more the numbers imply that a 3 hour refresh would infer that the relying party would need to validate 200,000 ROAS in 10,800 seconds. That probably needs some pretty quick hardware.

These numbers are pretty much a toss at a dart board, and the draft's authors' may well be using a different scale model to justify this recommended time cycle. What numbers did you have in mind Matt that would make this "SHOULD" 3 hour refresh cycle feasible in a big-I Internet scenario of universal use?


Geoff

WG Co-Chair hat off








_______________________________________________
sidr mailing list
sidr@ietf.org
https://www.ietf.org/mailman/listinfo/sidr

Reply via email to