Hi all, First of all to be very clear: there was no 'APNIC outage', APNIC did nothing wrong. This was a 'validator outage', and locally outages like these can continue to be experienced at any future moment until fixed versions are released and deployed. Note: network operators who run FORT or OpenBSD rpki-client side-by-side with routinator/octorpki will have seen a stable VRP merged set item count on their EBGP routers. In this situation RPKI validator software diversity helped the Internet remain more stable.
APNIC staff are commendable for having seen an opportunity to implement a workaround for this routinator 0.8.1 quirk, but APNIC is just one of the tens of thousands of Certificate Authorities in the RPKI ecosystem. In short: the observed state of December 1st, 2020 00:00 UTC is an expected and normal state in the RPKI ecosystem. I appreciate George for reaching out to the community to draw more attention to the situation, as it seems we can learn from exploring this situation in great detail. For many in the community RPKI is a new technology. Also it appears a similar issue exists in Cloudflare's OctoRPKI, so I notified their developers too about the problem & solution. Since there are implementations with a bug in the same equivalence class, this case is best handed over to the IETF. While keeping in mind our human perception of the concept of time generally is somewhat incompatible with how time works in the X.509 / RPKI crypto world... here are my lengthy debug notes. :-) TL;DR: the VRP drop is an implementation issue in some RPKI validators, can happen again solution: wait for fixed version, or run multiple different RPKI validator implementations side by side there a bit of time pressure: this bug potentially interacts negatively with Juniper PR1483097. Every 20 minutes I copy all RPKI data from the Internet, run rpki-client [1], and store the original RPKI data files, the program's execution log, and the resulting VRP list as individual ZFS snapshots for post-mortem analysis. A copy of my data can be downloaded: it is an exact snapshot of all input data from that moment, to replay the event in various implementations. http://sobornost.net/~job/rpki-20201201-0001-adrian.sobornost.net.tar.gz Looking at the process' log of December 1st, 2020 run starting at midnight for the string 'apnic': root@adrian:/tank/rpkirepositories/.zfs/snapshot/20201201-0001# fgrep apnic output/log Dec 01 00:00:01 rpki-client: https://tal.apnic.net/apnic.cer: https schema ignored Dec 01 00:00:01 rpki.apnic.net/repository: pulling from network Dec 01 00:00:03 rpki-client: rpki.apnic.net/repository: loaded from cache Dec 01 00:00:03 rpki-client: rpki.apnic.net/member_repository: pulling from network Dec 01 00:00:03 rpki-client: rpki.sub.apnic.net/repository: pulling from network Dec 01 00:00:03 rpki-client: rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer: certificate has expired Dec 01 00:00:03 rpki-client: rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/9lv88f3YSSS6iXQmzBvPX6hvnQM.cer: certificate has expired Dec 01 00:00:03 rpki-client: rpki.rand.apnic.net/repo: pulling from network Dec 01 00:00:04 rpki-client: rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/pBp2e-TKxusbiXQjNgwrQ1OsH_s.cer: certificate has expired Dec 01 00:00:04 rpki-client: rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/ZnMLuaQLNc_lmxGF9iLb0JAMbZA.cer: certificate has expired Dec 01 00:00:04 rpki-client: rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/yZYCtJIcaINWT0smUVwdY-TPNkQ.cer: certificate has expired Dec 01 00:00:04 rpki-client: rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/WFBPIARWFTaBikTQvkFutQVej0g.cer: certificate has expired Dec 01 00:00:05 rpki-client: rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/QmfPXQMASo_v3yE5XQ_oJFSLE8E.cer: certificate has expired Dec 01 00:00:05 rpki-client: rpki.sub.apnic.net/repository: loaded from cache Dec 01 00:00:05 rpki-client: rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/maB2Nu64AHCDMDGWpYxBvsxoj4A.cer: certificate has expired Dec 01 00:00:05 rpki-client: rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2/d0JlIBzwsNjMdvAm-Ir2i1XpkO4.cer: certificate has expired Dec 01 00:00:05 rpki-client: rpki.apnic.net/repository/B3A24F201D6611E28AC8837C72FD1FF2/0I2GgcK-TUfCopBV9m5olVhGF_c.cer: certificate has expired Dec 01 00:00:06 rpki-client: rpki.rand.apnic.net/repo: loaded from cache Dec 01 00:00:12 rpki-client: rpki.apnic.net/member_repository: loaded from cache (At the end of the process's run it had observed 62,154 VRPs under the APNIC TAL. A CSV & JSON file of the validation process output with all VRPs from that moment is also included in the tar.gz file.) In the above log we see that a number of certificates are expired, according to Tom's message [2] these certificates represents APNIC members whose membership has been closed. (for example: companies going out of business, or merger & acquisition) It is expected for organizations issuing cryptographic products to tie business events to validity periods in certificates. For the purpose of these notes I'll focus only on following the validation process towards 'ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer' in a manual fashion using command line utilities. After having pulled RPKI from the web (which operationally speaking end-to-end is a multi-hour process to get the data from signer to validator), a number of process steps have to be performed in order to produce a list of Validated ROA Payloads (VRPs). None of these steps can be skipped, and the order is important too. A single manifest file (https://tools.ietf.org/html/rfc6486) actually is a bundle of a few things: a start & end date of the file listing, a list of filenames and sha256 hashes, and a EE certificate (which also has its own embedded start & end date!), a serial number, and references to other things such as which entity signed it. The first step is to figure out whether a given manifest file is 'valid' (are the signatures right) and 'current' (the timestamp on the validator's wall clock is between both the manifest's embedded start & end date AND the EE certificate validity dates), and the 'latest' (should the validator have to choose between two versions of the file, both valid and current, pick the one with the highest number). So at December 1st 00:00:03 UTC, the manifest's start & end date, and the EE certificate's start and end date were: $ tar fxz rpki-20201201-0001-adrian.sobornost.net.tar.gz $ cd 20201201-0001/data/rpki.apnic.net/repository/B527EF581D6611E2BB468F7C72FD1FF2 $ ls -lahtr DmWk9f02tb1o6zySNAiXjJB6p58.mft -rw-r--r-- 1 job wheel 214K Nov 30 23:01 DmWk9f02tb1o6zySNAiXjJB6p58.mft This file's ctime appears to be November 30th, 23:01 # check manifest's econtent start & end date $ strings DmWk9f02tb1o6zySNAiXjJB6p58.mft | head -2 20201130230107Z 20201202230107Z December 1st 00:00:03 is between November 30th 23:01:07 and December 2nd 23:01:07: check! # check the manifest's embedded EE certificate start & end date: $ test-mft -vp DmWk9f02tb1o6zySNAiXjJB6p58.mft | openssl x509 -text | grep -A2 Validity Validity Not Before: Nov 30 23:01:07 2020 GMT Not After : Dec 2 23:01:07 2020 GMT December 1st 00:00:03 is between November 30th 23:01:07 and December 2nd 23:01:07: check! With the dates and signatures of the manifest file check out to be 'all lights green', the next step is to process the manifest's file listing. A manifest 'file listing' is checked through two steps: - is the listed file present? - is the sha256 hash (in base64 format) listed on the manifest the same as the sha256 hash computed by the validator using a copy of the listed file? # looking at manifest file listing: $ test-mft -v DmWk9f02tb1o6zySNAiXjJB6p58.mft | grep -A1 ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer 95: ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer hash YjVYYAHzd5UFgeKVJGa+2zLy6uQHH+j4EmiH43ypgZc= # checking whether file is present: $ ls -alhtr ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer -rw-r--r-- 1 job wheel 1.5K Nov 30 23:01 ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer # compute sha256 hash of the file $ sha256 -b ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer SHA256 (ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer) = YjVYYAHzd5UFgeKVJGa+2zLy6uQHH+j4EmiH43ypgZc= Indeed, the 'YjVYYAHzd5UFgeKVJGa+2zLy6uQHH+j4EmiH43ypgZc=' hash computed from the referenced certificate file is the same one as listed in the manifest file (which we inspected with test-mft)! Note that at this stage of the validation process the 'ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer' file has not been processed in any other way other than the equivalent of that 'sha256' OpenBSD utility. These 'jumps' from certificate to manifest to certificate using hashes & signatures serve multiple purposes: by first confirming a hash matches, the validator does not (yet) need to attempt any file content parsing (which would potentially be sensitive computing operations on an at that point in time a unknown and potentially dangerous file), and secondly: by checking the presence and hash of each file, the publication point's completeness and integrity is confirmed. Missing .roa files can result in network outages [3]. At this point the manifest file has been completely processed, the next step in the validation process can commence. Each and every referenced file is opened by the validator, embedded certificates and sigantures are verified, and then again file contents processed (could be manifests, certificates, CRLs, or ROA files). Let's inspect ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer: $ openssl x509 -in ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer -inform DER -text | grep -A2 Validity Validity Not Before: Oct 23 10:14:32 2019 GMT Not After : Dec 1 00:00:00 2020 GMT As the validator's wall clock was December 1st 00:00:03, we can see that ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer expired '3 seconds ago'. Note that before we observed that creation time on the manifest file which referenced this .cer file was November 30th, at that time this ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer certificate was valid, present, and current! One could say that ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer is a child of DmWk9f02tb1o6zySNAiXjJB6p58.mft. The ZwTFeTEC0uxi4JpTfGQbsyoqqhM.cer file might not even be under control of the entity which generated DmWk9f02tb1o6zySNAiXjJB6p58.mft. A child's expiry does not result in the death of the parent. If a validator considers all referenced files on a manifest to be invalid, solely because *upon further inspection* a file contained contained an expired EE certificate, I'd say it is an 'overreaction', a simple software defect. After all, there was a valid current manifest which listed a hash and that hash matched the file, so the file became eligible for X.509 certificate validation in the first place! It appears that Routinator conflates two distinct steps in the validation process: step 1) checking the validity of a RPKI manifest step 2) checking the validity of a file referenced from the in step 1 validated manifest A valid manifest referencing a (now expired) certificate is a legitimate state of being. What is not valid is for the manifest listing itself to be expired, or the manifest's EE certificate to be expired, or its CRL to be expired, or its parent certificate to be expired, or for any files listed on the manifest to be missing, or for any sha256 hashes to be different than listed on the manifest. Phew.... that's a mouthful of conditions! We're gonna have to work in IETF to capture this in simpler english. Conclusion ========== I'm not saying validators should accept expired data, they shouldn't! But it is *expected* that Certificate Authorities (like LIRs, NIRs, or even RIRs) set the expiration dates on cryptographic objects to be aligned with the reality of business contracts. This is a *critical* feature of the RPKI and makes RPKI superior to IRR data: finally there /are/ expiration dates on the equivalent of 'route:' objects. A repeat of the 'december 1st' VRP drop situation can come into existence at any future moment under any Trust Anchor, under any Certificate Authority. Simply put: network solely relying on current versions of octorpki or routinator are somewhat at risk when billing cycles end. Also, I do not recommend downgrading to older versions because of https://www.nlnetlabs.nl/projects/rpki/security-advisories/ (which perversely is a bug that *is not* resolved with rpki software diversity). I suspect it is OK for network operators to choose to sit this one out and just wait for a fixed version, provided it can be released in a manner of weeks. Because of Juniper PR1483097 (which probably still affects many currently deployed internet routers) the complete disappearance of VRPs can negatively impact internet traffic forwarding in the default-free zone, but as mentioned before impact is avoided both through multi-instance validator deployment combined with validator software diversity. There is a silver lining in all this: the most likely next occurance of this type of situation is January 1st, 2021, as then all kinds of LIR, NIR, or RIR business contracts are likely to start or stop. This gives nlnetlabs and cloudflare almost a full month to figure out a fix, release it, and for operators to deploy it in their networks during the holidays. The perfect excuse to escape any unwanted christmas dinner. ;-) I propose some of us continue discussion at [email protected] where through wordsmithing in the draft-ietf-sidrops-6486bis effort so we help any future RPKI implementers from walking into the same problem. Kind regards, Job [1]: https://pkgs.org/search/?q=rpki-client [2]: https://lists.nlnetlabs.nl/pipermail/rpki/2020-December/000238.html [3]: https://blog.apnic.net/2020/11/10/rpki-manifests-securely-declare-contents/ -- RPKI mailing list [email protected] https://lists.nlnetlabs.nl/mailman/listinfo/rpki
