Geoff,
I'm happy to accept that the new wording is poor, but I'm pretty sure
the old wording was also bad, and I think this discussion is important.
The old wording could easily be interpreted to suggest that once per day
was the correct frequency for pulling from a repository. (That is, I
believe the previous version was making a de facto recommendation for a
default behaivor of one pull every 24 hours ... there wasn't a RECOMMEND
in the text, but we all know that examples tend to be normative in this
type of document.)
1) So the first implicit question is: Should the working group be making
a recommendation as to the frequency with which a relying party pulls
from the repository?
Or equivalently: Is there a "wrong" frequency that people might use if
we didn't give them any guidence?
It seems that retreiving updates "too frequently" (e.g., every 5
minutes) strains the repository system and that retreiving updates "too
infrequently" (e.g., monthly) means that when I inject a new ROA into
the system, it will take "unacceptably long" for this information to
propogate to the relying parties that make use of this information.
Therefore, we should have text in the document that articulates some
middle ground that we believe is reasonable for the Internet. (I make no
claims that the current text in the document achieves this goal.)
2) The second question is: If we make a recommendation regarding
frequency with which relying parties should pull updates, what frequency
should we recommend.
Here, I understand that "everyone hitting the repository system at once"
is a bad outcome regardless of the frequency that we recommend. That is,
regardless of whether we recommend "once per day", "once per month", or
"eight times daily" we will likely see problems with too much server
load at midnight. If anyone can recommend text to avoid this phenomena
(i.e., to encourage people to spread out their queries to the repository
system), please send text.
I agree that there are roughly 30,000 AS numbers visible in BGP, so it's
reasonable to assume on the order of 30,000 relying parties who will be
routinely querying the repository system. We might also assume that
30,000 is a reasonable order of magnitude for the number of CAs in the
RPKI (we might easily average 2 CAs per AS, but surely not 10 CAs per AS).
However, one thing that wasn't clear from reading your analysis was how
many CAs a given repository server would be hosting. If a server run by
a large ISP or an RIR was providing a cache of all RPKI data, then
clients would have longer connections to this server (as they could
retrieve much of the data they need in one place), but they would be
unlikely to receive requests from all 30,000 relying parties (e.g. an
ISP might provide a complete cache for their customers but for
non-customers they would typically only serve data for which they are
authoritative). Alternatively, if a server is only serving data for a
small subset of the CAs in the RPKI, then it might receive requests from
all relying parties, but those sessions would tend to be short
(especially when nothing has changed).
In any case, I believe the way forward (with regards to server load) is
to answer the question, "How many simultaneous connections are
reasonable for a server that hosts publication points for X CAs?" and
then work backwards from there to determine if a given interval of
relying party requests is reasonable from the server standpoint. I admit
that I haven't completely thought through re-key, but I'll try to dig up
some rough connection-time numbers based on our relying party software,
and do a few back-of-the envolope computations.
With regards to client load, I'm not convinced that there's any problem
with frequent queries to the repository system. If the relying party
queries a publication point and rsync determines that nothing has
changed, then no changes are required to ethe relying party's local
cache and no cryptographic calculations are required. If something has
changed, then the relying party has to perform validation (which
includes cryptographic signature verification) on the manifest and any
new objects that have been added. (Additionally, there may be resulting
changes to the client's local cache ... e.g., if a new CRL revokes a
previously valid certificate ... but such changes don't require new
cryptographic computations, and so I believe the bottleneck is going to
be the one or two signature verifications per object changed [1]). Now
the point from the relying party side is that if 5,000 manifests change
and 10,000 signed objects are added to the repository system on a given
day, then the relying party needs to do roughly 30,000 signature
verifications regardless of whether it learns of all these changes at
once, or whether it learns of them in small batches throughout the
course of the day. Therefore, I don't see how making frequent checks for
new data has a significant impact on the relying party's processing load.
Finally, in addition to server and relying party processing loads, one
must also look at the benefit of frequent repository fetches. Keep in
mind, that a relying party has no way of distinguishing the following
two events: (A) a route advertisement is originated by an AS that is
authorized to advertise the route, but the relying party hasn't fetched
recently enough to obtain the new ROA; and (B) a route advertisement is
originated by an unauthorized entity that is attempting to hijack
address space. In this discussion, it is also important to note that
manifests can gaurantee that the relying party received all signed
objects that existed at the moment that the manifest was published
(i.e., a manifest can detect malicious deletion of data from a
repository or corruption of data in transit) but the manifest says
nothing about data that may have been added since the manifest was
issued. This is why there is benefit in a relying party going back to
the publication point perioidically to see whether a new manifest has
been issued.
In any case, it's good to know that we'll have plenty to talk about in
Hiroshima.
- Matt Lepinski
Geoff Huston wrote:
WG Co-Chair Hat OFF
Hi Matt,
entities who are actually using RPKI data for routing SHOULD be
fetching fresh data from the repositories at least once every three
hours.
3 hours?
At a first pass that seems very frequent.
From a server's perspective if there are 30,000 AS's out there and
each is running a local cache and each is a distinct relying party of
the RPKI system, then the local hit rate at the server would be 3 per
second, assuming that all the relying parties evenly spread their
load (which is a pretty wild assumption - the worst case is that all
30,000 attempt to resync at the 3 hour clock chime point) Assuming
that a repository sweep with no updates takes 30 seconds to complete
then the server would have an average load of some 90 concurrent sync
sessions. If there is a local rekey then the refresh would also imply
a reload of all the signed products at this repository publication
point. Assuming that this would then take 3 minutes to download, then
the rekey load per server would be of the order of 540 concurrent
rsync sessions as an average load. These load numbers appear to me to
be somewhat large.
From the relying party's perspective if there are 30,000 distinct
RPKI repository publication points, and a serial form of local
synchronisation using a top-down tree walk then the same set of
assumptions imply that the relying party's perspective then it needs
to process the synchronisation with the remote cache (including
minimally the manifest crypto calculation at a rate of 3 per second.
Assuming that there are 200,000 distinct ROAs out there that are re-
validated at each fetch then once more the numbers imply that a 3
hour refresh would infer that the relying party would need to
validate 200,000 ROAS in 10,800 seconds. That probably needs some
pretty quick hardware.
These numbers are pretty much a toss at a dart board, and the draft's
authors' may well be using a different scale model to justify this
recommended time cycle. What numbers did you have in mind Matt that
would make this "SHOULD" 3 hour refresh cycle feasible in a big-I
Internet scenario of universal use?
Geoff
WG Co-Chair hat off
_______________________________________________
sidr mailing list
sidr@ietf.org
https://www.ietf.org/mailman/listinfo/sidr