Re: plea for comcast/sprint handoff debug help

2020-11-11 Thread Tony Tauber
Yes, to Tim and the NLnet Labs folks,

Thanks for responding to the community concerns and experiences.

Tony

On Wed, Nov 11, 2020 at 10:48 AM Christopher Morrow 
wrote:

> On Wed, Nov 11, 2020 at 9:06 AM Tim Bruijnzeels  wrote:
> >
> > Hi Chris, list,
> >
> > > On 10 Nov 2020, at 05:22, Christopher Morrow 
> wrote:
> > >
> > > sure... it's just made one set of decisions. I was hoping with some
> > > discussion we'd get to:
> > > Welp, sure we can fallback and try rsync if we don't see success in
>  time.
> >
> > We will implement fallback in the next release of routinator.
>
> cool thanks!
>
> > We still believe that there are concerns why one may not want to fall
> back, but we also believe that it will be more constructive to have the
> technical discussion on this as part of the ongoing deprecate rsync effort
> in the sidrops working group in the IETF.
> >
>
> I look forward to chatting about this :)
> I think, yes with the coming (so soon!) deprecation of rsync having a
> smooth transition of power from rsync -> rrdp would be great.
>
> thanks for reconsidering!
> -chris
>


Re: plea for comcast/sprint handoff debug help

2020-11-11 Thread Christopher Morrow
On Wed, Nov 11, 2020 at 9:06 AM Tim Bruijnzeels  wrote:
>
> Hi Chris, list,
>
> > On 10 Nov 2020, at 05:22, Christopher Morrow  
> > wrote:
> >
> > sure... it's just made one set of decisions. I was hoping with some
> > discussion we'd get to:
> > Welp, sure we can fallback and try rsync if we don't see success in  
> > time.
>
> We will implement fallback in the next release of routinator.

cool thanks!

> We still believe that there are concerns why one may not want to fall back, 
> but we also believe that it will be more constructive to have the technical 
> discussion on this as part of the ongoing deprecate rsync effort in the 
> sidrops working group in the IETF.
>

I look forward to chatting about this :)
I think, yes with the coming (so soon!) deprecation of rsync having a
smooth transition of power from rsync -> rrdp would be great.

thanks for reconsidering!
-chris


Re: plea for comcast/sprint handoff debug help

2020-11-11 Thread Tim Bruijnzeels
Hi Chris, list,

> On 10 Nov 2020, at 05:22, Christopher Morrow  wrote:
> 
> sure... it's just made one set of decisions. I was hoping with some
> discussion we'd get to:
> Welp, sure we can fallback and try rsync if we don't see success in  
> time.

We will implement fallback in the next release of routinator.

We still believe that there are concerns why one may not want to fall back, but 
we also believe that it will be more constructive to have the technical 
discussion on this as part of the ongoing deprecate rsync effort in the sidrops 
working group in the IETF.

Regards,

Tim



Re: plea for comcast/sprint handoff debug help

2020-11-09 Thread Christopher Morrow
On Fri, Nov 6, 2020 at 3:09 PM Randy Bush  wrote:
>
> >> really?  could you be exact, please?  turning an optional protocol off
> >> is not a 'failure mode'.
> > I suppose it depends on how you think you are serving the data.
> > If you thought you were serving it on both protocols, but 'suddenly'
> > the RRDP location was empty that would be a failure.
>
> not necessarily.  it could merely be a decision to stop serving rrdp.
> perhaps a security choice; perhaps a software change; perhaps a phase
> of the moon.

right this is all in the same set of: "failure modes not caught"
(I think, I don't care so much WHY you stopped serving RRDP, just that
after a few failures
the caller should try my other number (rsync))

>
> as i do not see rrdp as a critical service, after all it is not mti,
> but i am quite aware of whether it is running or not.  the problem is
> that routinator seems not to be.

sure... it's just made one set of decisions. I was hoping with some
discussion we'd get to:
Welp, sure we can fallback and try rsync if we don't see success in  time.


Re: plea for comcast/sprint handoff debug help

2020-11-06 Thread Randy Bush
i may understand one place you could get confused.  unlike a root CA
which publishes a TAL which describes transports, a non-root CA does not
publish a TAL describing what transports it supports.  of course, rsync
is mandatory to provide; but anything else is "if it works, enjoy it.
otherwise use rsync."

randy


Re: plea for comcast/sprint handoff debug help

2020-11-06 Thread Randy Bush
>> really?  could you be exact, please?  turning an optional protocol off
>> is not a 'failure mode'.
> I suppose it depends on how you think you are serving the data.
> If you thought you were serving it on both protocols, but 'suddenly'
> the RRDP location was empty that would be a failure.

not necessarily.  it could merely be a decision to stop serving rrdp.
perhaps a security choice; perhaps a software change; perhaps a phase
of the moon.

> One of my points was that it appeared that the software called 'bad
> tls cert' (among other things I'm sure) a failure, but not 'empty
> directory' (or no diff file). It's possible that ALSO 'no diff' is
> considered a failure

what the broken client software called what is not my probem.  every
http[s] server in the universe is not necessarily an rrdp server.  if
the client has some belief, for whatever reason, that it should be is
a brokenness.

> I don't think alex is wrong in stating that 'ideally the operator
> monitors/alerts on health of their service'

i do.  i run clients.

> My suggestion is that checking the alternate transport is helpful.

as i do not see rrdp as a critical service, after all it is not mti,
but i am quite aware of whether it is running or not.  the problem is
that rotinator seems not to be.

randy


Re: plea for comcast/sprint handoff debug help

2020-11-06 Thread Christopher Morrow
On Fri, Nov 6, 2020 at 5:47 AM Randy Bush  wrote:
>
> > Admittedly someone (randy) injected a pretty pathological failure
> > mode into the system
>
> really?  could you be exact, please?  turning an optional protocol off
> is not a 'failure mode'.

I suppose it depends on how you think you are serving the data.
If you thought you were serving it on both protocols, but 'suddenly' the RRDP
location was empty that would be a failure.

Same if your RRDP location's tls certificate dies...
One of my points was that it appeared that the software called 'bad
tls cert' (among other things I'm sure)
a failure, but not 'empty directory' (or no diff file). It's possible
that ALSO 'no diff' is considered a failure
but that swapping to  alternate transport after a few failures was not
implemented. (I don't know, I have not looked
at that part of the code, and I don't think alex/tim said either way).

I don't think alex is wrong in stating that 'ideally the operator
monitors/alerts on health of their service', I
think it's shockingly often that this isn't actually done though. (and
isn't germaine in the case of the test / research in question)

My suggestion is that checking the alternate transport is helpful.

-chris


Re: plea for comcast/sprint handoff debug help

2020-11-06 Thread Tony Tauber
On Fri, Nov 6, 2020 at 1:28 AM Christopher Morrow 
wrote:


> I think a way forward here is to offer a suggestion for the software
> folk to cogitate on and improve?
>"What if (for either rrdp or rsync) there is no successful
> update[0] in X of Y attempts,
>attempt the other protocol to sync down to bring the remote PP back
> to life in your local view."
>
>
 100%  Please do this.
 I also agree with Job's pleas to consider this work as part of the plath
outlined in the RSYNC->RRDP transition draft mentioned below.

Tony


> This both allows the RP software to pick their primary path (and stick
> to that path as long as things work) AND
> helps the PP folk recover a bit quicker if their deployment runs into
> troubles.
>


> >
> > > This is a tradeoff. I think that protecting against replay should be
> > > considered more important here, given the numbers and time to fix
> > > HTTPS issue.
> >
> > The 'replay' issue you perceive is also present in RRDP. The RPKI is a
> > *deployed* system on the Internet and it is important for Routinator to
> > remain interopable with other non-nlnetlabs implementations.
> >
> > Routinator not falling back to rsync does *not* offer a security
> > advantage, but does negatively impact our industry's ability to migrate
> > to RRDP. We are in 'phase 0' as described in Section 3 of
> > https://tools.ietf.org/html/draft-sidrops-bruijnzeels-deprecate-rsync
> >
> > Regards,
> >
> > Job
>


Re: plea for comcast/sprint handoff debug help

2020-11-06 Thread Randy Bush
> Admittedly someone (randy) injected a pretty pathological failure
> mode into the system

really?  could you be exact, please?  turning an optional protocol off
is not a 'failure mode'.

randy


Re: plea for comcast/sprint handoff debug help

2020-11-05 Thread Christopher Morrow
I hate to jump in late. but... :)

After reading this a few times it seems like what's going on is:
  o a set of assumptions were built into the software stack
 this seems fine, hard to build with some assumptions :)

  o the assumptions seem to include: "if rrdp fails  feel free
to jump back/to rsync"
I think SOME of the problem is the 'how' there.
Admittedly someone (randy) injected a pretty pathological failure
mode into the system
and didn't react when his 'monitoring' said: "things are broke yo!"

  o absent a 'failure' the software kept on getting along as it had before.
Afterall, maybe the operator here intentionally put their
repository into this whacky state?
How is an RP software stack supposed to know what the PP's
management is meaning to do?

  o lots of debate about how we got to where we are, I don't know that
much of it is really helpful.

I think a way forward here is to offer a suggestion for the software
folk to cogitate on and improve?
   "What if (for either rrdp or rsync) there is no successful
update[0] in X of Y attempts,
   attempt the other protocol to sync down to bring the remote PP back
to life in your local view."

This both allows the RP software to pick their primary path (and stick
to that path as long as things work) AND
helps the PP folk recover a bit quicker if their deployment runs into troubles.

0: I think 'failure' here is clear (to me):
1) the protocol is broken (rsync no connect, no http connect)
2) the connection succeeds but there is no sync-file (rrdp) nor
valid MFT/CRL

The 6486-bis rework effort seems to be getting to: "No MFT? no CRL?
you r busted!"
so I think if you don't get MFT/CRL in X of Y attempts it's safe to
say the PP over that protocol is busted,
and attempting the other proto is acceptable.

thanks!
-chris

On Mon, Nov 2, 2020 at 4:37 AM Job Snijders  wrote:
>
> On Mon, Nov 02, 2020 at 09:13:16AM +0100, Tim Bruijnzeels wrote:
> > On the other hand, the fallback exposes a Malicious-in-the-Middle
> > replay attack surface for 100% of the prefixes published using RRDP,
> > 100% of the time. This allows attackers to prevent changes in ROAs to
> > be seen.
>
> This is a mischaracterization of what is going on. The implication of
> what you say here is that RPKI cannot work reliably over RSYNC, which is
> factually incorrect and an injustice to all existing RSYNC based
> deployment. Your view on the security model seems to ignore the
> existence of RPKI manifests and the use of CRLs, which exist exactly to
> mitigate replays.
>
> Up until 2 weeks ago Routintar indeed was not correctly validating RPKI
> data, fortunately this has now been fixed:
> https://mailman.nanog.org/pipermail/nanog/2020-October/210318.html
>
> Also via the RRDP protocol old data be replayed, because because just
> like RSYNC, the RRDP protocol does not have authentication. When RPKI
> data is transported from Publication Point (RP) to Relying Party, the RP
> cannot assume there was an unbroken 'chain of custody' and therefor has
> to validate all the RPKI signatures.
>
> For example, if a CDN is used to distribute RRDP data, the CDN is the
> MITM (that is literally what CDNs are: reverse proxies, in the middle).
> The CDN could accidentally serve up old (cached) content or misserve
> current content (swap 2 filenames with each other).
>
> > This is a tradeoff. I think that protecting against replay should be
> > considered more important here, given the numbers and time to fix
> > HTTPS issue.
>
> The 'replay' issue you perceive is also present in RRDP. The RPKI is a
> *deployed* system on the Internet and it is important for Routinator to
> remain interopable with other non-nlnetlabs implementations.
>
> Routinator not falling back to rsync does *not* offer a security
> advantage, but does negatively impact our industry's ability to migrate
> to RRDP. We are in 'phase 0' as described in Section 3 of
> https://tools.ietf.org/html/draft-sidrops-bruijnzeels-deprecate-rsync
>
> Regards,
>
> Job


Re: plea for comcast/sprint handoff debug help

2020-11-02 Thread Job Snijders
On Mon, Nov 02, 2020 at 09:13:16AM +0100, Tim Bruijnzeels wrote:
> On the other hand, the fallback exposes a Malicious-in-the-Middle
> replay attack surface for 100% of the prefixes published using RRDP,
> 100% of the time. This allows attackers to prevent changes in ROAs to
> be seen.

This is a mischaracterization of what is going on. The implication of
what you say here is that RPKI cannot work reliably over RSYNC, which is
factually incorrect and an injustice to all existing RSYNC based
deployment. Your view on the security model seems to ignore the
existence of RPKI manifests and the use of CRLs, which exist exactly to
mitigate replays.

Up until 2 weeks ago Routintar indeed was not correctly validating RPKI
data, fortunately this has now been fixed:
https://mailman.nanog.org/pipermail/nanog/2020-October/210318.html

Also via the RRDP protocol old data be replayed, because because just
like RSYNC, the RRDP protocol does not have authentication. When RPKI
data is transported from Publication Point (RP) to Relying Party, the RP
cannot assume there was an unbroken 'chain of custody' and therefor has
to validate all the RPKI signatures.

For example, if a CDN is used to distribute RRDP data, the CDN is the
MITM (that is literally what CDNs are: reverse proxies, in the middle).
The CDN could accidentally serve up old (cached) content or misserve
current content (swap 2 filenames with each other).

> This is a tradeoff. I think that protecting against replay should be
> considered more important here, given the numbers and time to fix
> HTTPS issue.

The 'replay' issue you perceive is also present in RRDP. The RPKI is a
*deployed* system on the Internet and it is important for Routinator to
remain interopable with other non-nlnetlabs implementations.

Routinator not falling back to rsync does *not* offer a security
advantage, but does negatively impact our industry's ability to migrate
to RRDP. We are in 'phase 0' as described in Section 3 of
https://tools.ietf.org/html/draft-sidrops-bruijnzeels-deprecate-rsync

Regards,

Job


Re: plea for comcast/sprint handoff debug help

2020-11-02 Thread Tim Bruijnzeels
Hi Randy, all,

> On 31 Oct 2020, at 04:55, Randy Bush  wrote:
> 
>> If there is a covering less specific ROA issued by a parent, this will
>> then result in RPKI invalid routes.
> 
> i.e. the upstream kills the customer.  not a wise business model.

I did not say it was. But this is the problematic case.

For the vast majority of ROAs the sustained loss of the repository would lead 
to invalid ROA *objects*, which will not be used in Route Origin Validation 
anymore leading to the state 'Not Found' for the associated announcements.

This is not the case if there are other ROAs for the same prefixes published by 
others (most likely the parent). Quick back of the envelope analysis: this 
affects about 0.05% of ROA prefixes.

>> The fall-back may help in cases where there is an accidental outage of
>> the RRDP server (for as long as the rsync servers can deal with the
>> load)
> 
> folk try different software, try different configurations, realize that
> having their CA gooey exposed because they wanted to serve rrdp and
> block, ...

We are talking here about the HTTPS server being unavailable, while rsync *is*.

So this means, your HTTPS server is down, unreachable, or has an issue with its 
HTTPS certificate. Your repository could use a CDN if they don't want to do all 
this themselves. They could monitor, and fix things.. there is time.

Thing is even if HTTPs becomes unavailable this still leaves hours (8 by 
default for the Krill CA, configurable) to fix things. Routinator (and the RIPE 
NCC Validator, and others) will use cached data if they cannot retrieve new 
data. It's only when manifests and CRLs start to expire that the objects would 
become invalid.

So the fallback helps in case of incidents with HTTPS that were not fixed 
within 8 hours for 0.05% of prefixes.

On the other hand, the fallback exposes a Malicious-in-the-Middle replay attack 
surface for 100% of the prefixes published using RRDP, 100% of the time. This 
allows attackers to prevent changes in ROAs to be seen.

This is a tradeoff. I think that protecting against replay should be considered 
more important here, given the numbers and time to fix HTTPS issue.


> randy, finding the fort rp to be pretty solid!

Unrelated, but sure I like Fort too.

Tim

Re: plea for comcast/sprint handoff debug help

2020-10-31 Thread Randy Bush
> cc.rg.net was unavailable over rsync for several days this week as
> well.

sorry.  it was cb and cc.  it seems some broken RPs did not have the
ROA needed to get to our westin pop.  cf this whole thread.

luckily such things never happen in real operations. :)

randy


Re: plea for comcast/sprint handoff debug help

2020-10-31 Thread Alex Band
Hi Tony,

I realise there are quite some moving parts so I'll try to summarise our design 
choices and reasoning as clearly as possible. 

Rsync was the original transport for RPKI and is still mandatory to implement. 
RRDP (which uses HTTPS) was introduced to overcome some of the shortcomings of 
rsync. Right now, all five RIRs make their Trust Anchors available over HTTPS, 
all but two RPKI repositories support RRDP and all but one relying party 
software packages support RRDP. There is currently an IETF draft to deprecate 
the use of rsync.

As a result, the bulk of RPKI traffic is currently transported over RRDP and 
only a small amount relies on rsync. For example, our RPKI repository is 
configured accordingly: rrdp.rpki.nlnetlabs.nl is served by a CDN and 
rsync.rpki.nlnetlabs.nl runs rsyncd on a simple, small VM to deal with the 
remaining traffic. When operators deploying our Krill Delegated RPKI software 
ask us what to expect and how to provision their services, this is how we 
explain the current state of affairs.

With this is mind, Routinator currently has this fetching strategy:

1. It starts by connecting to the Trust Anchors of the RIRs over HTTPS, if 
possible, and otherwise use rsync. 
2. It follows the certificate tree, following several pointers to publication 
servers along the way. These pointers can be rsync only or there can be two 
pointers, one to rsync and one to RRDP.
3. If an RRDP pointer is found, Routinator will try to connect to the service, 
verify if there is a valid TLS certificate and data can be successfully 
fetched. If it can, the server is marked as usable and it'll prefer it. If the 
initial check fails, Routinator will use rsync, but verify RRDP works on the 
next validation run.
4. If RRDP worked before but is unavailable for any reason, Routinator will 
used cached data and try again on the next run instead of immediately falling 
back to rsync.
5. If the RPKI publication server operator takes away the pointer to RRDP to 
indicate they no longer offer this communication protocol, Routinator will use 
rsync.
6. If Routinator's cache is cleared, the process will start fresh

This strategy was implemented with repository server provisioning in mind. We 
are assuming that if you actively indicate that you offer RRDP, you actually 
provide a monitored service there. As such, an outage would be assumed to be 
transient in nature. Routinator could fall back immediately, of course. But our 
thinking was that if the RRDP service would have a small hiccup, currently a 
1000+ Routinator instances would be hammering a possibly underprovisioned rsync 
server, perhaps causing even more problems for the operator.

"Transient" is currently the focus. In Randy's experiment, he is actively 
advertising he offers RRDP, but doesn't offer a service there for weeks at a 
time. As I write this, ca.rg.net. cb.rg.net and cc.rg.net have been returning a 
404 on their RRDP endpoint several weeks and counting. cc.rg.net was 
unavailable over rsync for several days this week as well. 

I would assume this is not how operators would run their RPKI publication 
server normally. Not having an RRDP service for weeks when you advertise you do 
is fine for an experiment but constitutes pretty bad operational practice for a 
production network. If a service becomes unavailable, the operator would 
swiftly be contacted and the issue would be resolved, like Randy and I have 
done in happier times:

https://twitter.com/alexander_band/status/1209365918624755712
https://twitter.com/enoclue/status/1209933106720829440

On a personal note, I realise the situation has a dumpster fire feel to it. I 
have contacted Randy about his outages months ago, not knowing they were a 
research project. I never got a reply. Instead of discussing his research and 
the observed effects, it feels like a 'gotcha' to present the findings in this 
way. It could even be considered irresponsible, if the fallout is as bad as he 
claims. The notion that using our software is quote, "a disaster waiting to 
happen", is disingenuous at best:

https://www.ripe.net/ripe/mail/archives/members-discuss/2020-September/004239.html

Routinator design was to try to deal with outages in a responsible manner for 
all actors involved. Again, of course we can change our strategy as a result of 
this discussion, which I'm happy we're now actually having. In that case I 
would advise operators who offer an RPKI publication server to ensure that they 
provision their rsyncd service so that it is capable of handling all of the 
traffic that their RRDP service normally handles, in case RRDP has a glitch. 
And, even if people will scale their rsync service accordingly, they will only 
ever find out if it actually does in a time of crisis.

Kind regards,

-Alex

> On 31 Oct 2020, at 07:17, Tony Tauber  wrote:
> 
> As I've pointed out to Randy and others and I'll share here.
> We planned, but hadn't yet upgraded our Routinator RP (Relying Party) 
> 

Re: plea for comcast/sprint handoff debug help

2020-10-31 Thread Randy Bush
> r0.sea#sh ip bgp rpki table | i 3130 
> 147.28.0.0/2020  3130   0   147.28.0.84/323
> 147.28.0.0/1919  3130   0   147.28.0.84/323
> 147.28.64.0/19   19  3130   0   147.28.0.84/323
> 147.28.96.0/19   19  3130   0   147.28.0.84/323
> 147.28.128.0/19  19  3130   0   147.28.0.84/323
> 147.28.160.0/19  19  3130   0   147.28.0.84/323
> 147.28.192.0/19  19  3130   0   147.28.0.84/323
> 192.83.230.0/24  24  3130   0   147.28.0.84/323
> 198.180.151.0/25 25  3130   0   147.28.0.84/323  <<<===
> 198.180.151.0/24 24  3130   0   147.28.0.84/323
> 198.180.153.0/24 24  3130   0   147.28.0.84/323

note rov ops: if you do not see that /25 in your router(s), the RP
software you are running can be damaging to your customers and to
others.

randy


Re: plea for comcast/sprint handoff debug help

2020-10-31 Thread Tony Tauber
As I've pointed out to Randy and others and I'll share here.
We planned, but hadn't yet upgraded our Routinator RP (Relying Party)
software to the latest v0.8 which I knew had some improvements.
I assumed the problems we were seeing would be fixed by the upgrade.
Indeed, when I pulled down the new SW to a test machine, loaded and ran it,
I could get both Randy's ROAs.
I figured I was good to go.
Then we upgraded the prod machine to the new version and the problem
persisted.
An hour or two of analysis made me realize that the "stickiness" of a
particular PP (Publication Point) is encoded in the cache filesystem.
Routinator seems to build entries in its cache directory under either
rsync, rrdp, or http and the rg.net PPs weren’t showing under rsync but
moving the cache directory aside and forcing it to rebuild fixed the issue.

A couple of points seem to follow:

   - Randy says: "finding the fort rp to be pretty solid!"  I'll say that
   if you loaded a fresh Fort and fresh Routinator install, they would both
   have your ROAs.
   - The sense of "stickiness" is local only; hence to my mind the
   protection against "downgrade" attack is somewhat illusory. A fresh install
   knows nothing of history.

Tony

On Fri, Oct 30, 2020 at 11:57 PM Randy Bush  wrote:

> > If there is a covering less specific ROA issued by a parent, this will
> > then result in RPKI invalid routes.
>
> i.e. the upstream kills the customer.  not a wise business model.
>
> > The fall-back may help in cases where there is an accidental outage of
> > the RRDP server (for as long as the rsync servers can deal with the
> > load)
>
> folk try different software, try different configurations, realize that
> having their CA gooey exposed because they wanted to serve rrdp and
> block, ...
>
> randy, finding the fort rp to be pretty solid!
>


Re: plea for comcast/sprint handoff debug help

2020-10-30 Thread Randy Bush
> If there is a covering less specific ROA issued by a parent, this will
> then result in RPKI invalid routes.

i.e. the upstream kills the customer.  not a wise business model.

> The fall-back may help in cases where there is an accidental outage of
> the RRDP server (for as long as the rsync servers can deal with the
> load)

folk try different software, try different configurations, realize that
having their CA gooey exposed because they wanted to serve rrdp and
block, ...

randy, finding the fort rp to be pretty solid!


RPKI over RSYNC vs RRDP (Was: plea for comcast/sprint handoff debug help)

2020-10-30 Thread Job Snijders
On Fri, Oct 30, 2020 at 12:47:44PM +0100, Alex Band wrote:
> > On 30 Oct 2020, at 01:10, Randy Bush  wrote:
> > i'll see your blog post and raise you a peer reviewed academic paper
> > and two rfcs :)
> 
> For the readers wondering what is going on here: there is a reason
> there is only a vague mention to two RFCs instead of the specific
> paragraph where it says that Relying Party software must fall back to
> rsync immediately if RRDP is temporarily unavailable. That is because
> this section doesn’t exist.

*skeptical face* Alex, you got it backwards: the section that does not
exist, is to *not* fall back to rsync. But on the other hand, there are
ample RFC sections which outline rsync is the mandatory-to-implement
protocol. Starts at RFC 6481 Section 3: "The publication repository
MUST be available using rsync".

Even the RRDP RFC itself (RFC 8182) describes that RSYNC and RRDP
*co-exist*. I think this co-existence was factored into both the design
of RPKIoverRSYNC and subsequently RPKIoverRRDP. An rsync publication
point does not become invalid because of the demise of an
once-upon-a-time valid RRDP publication point.

Only a few weeks ago a large NIR (IDNIC) disabled their RRDP service
because somehow the RSYNC and RRDP repositories were out-of-sync with
each other. The RRDP service remained disabled for a number of days
until they repaired their RPKI Certificate Authority service.

I suppose that during this time, Routinator was unable to receive any
updates related to the IDNIC CA (pinned to RRDP -> because of a prior
successful fetch prior to the partial IDNIC RPKI outage). This in turn
deprived the IDNIC subordinate Resource Holders the ability to update
their Route Origin Authorization attestations (from Routinator's
perspective).

Given that RRDP is an *optional* protocol in the RPKI stack, it doesn't
make sense to me to strictly pin fetching operations to RRDP: Over time
(months, years), a CA could enable / disable / enable / disable RRDP
service, while listing the RRDP URI as a valid SIA, amongst other valid
SIAs.

An analogy to DNS: A website operator may add  records to indicate
IPv6 reachability, but over time may also remove the  record if
there (temporarily) is some kind of issue with the IPv6 service. The
Internet operations community of course encourages everyone to add 
records, and IPv6 Happy Eyeballs were a concept to for a long time even
*favor* IPv6 over IPv4 to help improve IPv6 adoption, but a dual-stack
browser will always try to make benefit of the redundancy that exists
through the two address families.

RSYNC and RRDP should be viewed in a similar context as v4 vs v6, but
unlike with IPv4 and IPv6, I am convinced that RSYNC can be deprecated
in the span of 3 or 4 years, the draft-sidrops-bruijnzeels-deprecate-rsync
document is helping towards that goal! 

> Be that as it may, operators can rest assured that if consensus goes
> against our logic, we will change our design.

Please change the implementation a little bit (0.8.1). I think it is too
soon for the internet wide 'rsync to RRDP' migration project to be
declared complete and successfull, and this actually hampers the
transition to RRDP.

Pinning to RRDP *forever* violates the principle-of-least-astonishment
in a world where draft-sidrops-bruijnzeels-deprecate-rsync-00 was
published only as recent as November 2019. That draft now is a working
group document, and it will probably take another 1 or 2 years before it
is published as RFC.

Section 5 of 'draft-deprecate-rsync' says RRDP *SHOULD* be used when it
is available. Thus it logically follows, when it is not available, the
lowest common denominator is to be used: rsync. After all, the Issuing
CA put an RSYNC URI in the 'Subject Information Access' (SIA). Who knows
better than the CA?

The ability to publish routing intentions, and for others to honor the
intentions of the CA is what RPKI is all about. When the CA says
delegated RPKI data is available at both an RSYNC URI and an RRDP URI,
both are valid network entrypoints to the publication point. The
resource holder's X.509 signature even is on those 'reference to there'
directions (URIs)! :-)

If I can make a small suggestion: make 0.8.1 fall back to rsync after
waiting an hour or so, (meanwhile polling to see if the the RRDP service
restores). This way the network operator takes advantage of both
transport protocols, whichever is available, with a clear preference to
try RRDP first, then eventually rsync.

RPKI was designed in such a way that it can be transported even over
printed paper, usb stick, bluetooth, vinyl, rsync, and also https (as
rrdp). Because RPKI data is signed using the X.509 framework, the
transportation method really is irrelevant. IP holders can publish RPKI
data via horse + cart, and still make productive use of it!

Routinator's behavior is not RFC compliant, and has tangible effects in
the default-free zone.

Regards,

Job


Re: plea for comcast/sprint handoff debug help

2020-10-30 Thread Tim Bruijnzeels
Hi Job, all,

> On 30 Oct 2020, at 11:06, Job Snijders  wrote:
> 
> On Thu, Oct 29, 2020 at 09:14:16PM +0100, Alex Band wrote:
>> In fact, we argue that it's actually a bad idea to do so:
>> 
>> https://blog.nlnetlabs.nl/why-routinator-doesnt-fall-back-to-rsync/
>> 
>> We're interested to hear views on this from both an operational and
>> security perspective.
> 
> I don't see a compelling reason to not use rsync when RRDP is
> unavailable.
> 
> Quoting from the blog post:
> 
>"While this isn’t threatening the integrity of the RPKI – all data
>is cryptographically signed making it really difficult to forge data
>– it is possible to withhold information or replay old data."
> 
> RRDP does not solve the issue of withholding data or replaying old data.
> The RRDP protocol /also/ is unauthenticated, just like rsync. The RRDP
> protocol basically is rsync wrapped in XML over HTTPS.
> 
> Withholding of information is detected through verification of RPKI
> manifests (something Routinator didn't verify up until last week!),
> and replaying of old data is addressed by checking validity dates and
> CRLs (something Routinator also didn't do until last week!).
> 
> Of course I see advantages to this industry mainly using RRDP, but those
> are not security advantages. The big migration towards RRDP can happen
> somewhere in the next few years.


Routinator does TLS verification when it encounters an RRDP repository. If the 
repository cannot be reached, or its HTTPS certificate is somehow invalid, it 
will use rsync instead. It's only after it found a *valid* HTTPS connection, 
that it refuses to fall back.

There is a security angle here.

Malicious-in-the-middle attacks can lead an RP to a bogus HTTPS server and 
force the software to downgrade to rsync, which has no channel security. The 
software can then be given old data (new ROAs can be withheld), or the attacker 
can simply withhold a single object. With the stricter publication point 
completeness validation introduced by RFC6486-bis this will lead to the 
rejecting of all ROAs published there.

The result is the exact same problem that Randy et al.'s research pointed at. 
If there is a covering less specific ROA issued by a parent, this will then 
result in RPKI invalid routes.

The fall-back may help in cases where there is an accidental outage of the RRDP 
server (for as long as the rsync servers can deal with the load), but it 
increases the attack surface for repositories that keep their RRDP server 
available.

Regards,
Tim



> 
> The arguments brought forward in the blog post don't make sense to me.
> The '150,000' number in the blog post seems a number pulled from thin
> air.
> 
> Regards,
> 
> Job



RE: [SPAM] Re: plea for comcast/sprint handoff debug help

2020-10-30 Thread p.fazio
please remove me from list


 Original Message 
Subject: [SPAM] Re: plea for comcast/sprint handoff debug help
From: Alex Band <a...@nlnetlabs.nl>
Date: Thu, October 29, 2020 2:14 pm
To: Randy Bush <ra...@psg.com>
Cc: North American Network Operators' Group <nanog@nanog.org>


> On 28 Oct 2020, at 16:58, Randy Bush <ra...@psg.com> wrote:
> 
>> tl;dr:
>> 
>> comcast: does your 50.242.151.5 westin router receive the announcement
>> of 147.28.0.0/20 from sprint's westin router 144.232.9.61?
> 
> tl;dr: diagnosed by comcast.  see our short paper to be presented at imc
>   tomorrow https://archive.psg.com/200927.imc-rp.pdf
> 
> lesson: route origin relying party software may cause as much damage as
> 	it ameliorates
> 
> randy

To clarify this for the readers here: there is an ongoing research experiment where connectivity to the RRDP and rsync endpoints of several RPKI publication servers is being purposely enabled and disabled for prolonged periods of time. This is perfectly fine of course.

While the resulting paper presented at IMC is certainly interesting, having relying party software fall back to rsync when RRDP is unavailable is not a requirement specified in any RFC, as the paper seems to suggest. In fact, we argue that it's actually a bad idea to do so:

https://blog.nlnetlabs.nl/why-routinator-doesnt-fall-back-to-rsync/

We're interested to hear views on this from both an operational and security perspective.

-Alex




Re: plea for comcast/sprint handoff debug help

2020-10-30 Thread Tom Beecher
Alex:

When I follow the RFC rabbit hole :

RFC6481 :  A Profile for Resource Certificate Repository Structure

The publication repository MUST be available using rsync
>  [RFC5781 ] [RSYNC 
> ].  Support of additional 
> retrieval mechanisms
>  is the choice of the repository operator.  The supported
>  retrieval mechanisms MUST be consistent with the accessMethod
>  element value(s) specified in the SIA of the associated CA or
>  EE certificate.
>
>
Then :

RFC8182 :  The RPKI Repository Delta Protocol (RRDP)

 This document allows the use of RRDP as an additional repository
>distribution mechanism for RPKI.  In time, RRDP may replace rsync
>[RSYNC ] as the only 
> mandatory-to-implement repository distribution
>mechanism.  However, this transition is outside of the scope of this
>document.
>
>
Is this not the case then that currently rsync is still mandatory, even if
RRDP is in place?  Or is there a more current RFC that has defined the
transition that I did not locate?

On Fri, Oct 30, 2020 at 7:49 AM Alex Band  wrote:

>
> > On 30 Oct 2020, at 01:10, Randy Bush  wrote:
> >
> > i'll see your blog post and raise you a peer reviewed academic paper and
> > two rfcs :)
>
> For the readers wondering what is going on here: there is a reason there
> is only a vague mention to two RFCs instead of the specific paragraph where
> it says that Relying Party software must fall back to rsync immediately if
> RRDP is temporarily unavailable. That is because this section doesn’t
> exist. The point is that there is no bug and in fact, Routinator has a
> carefully thought out strategy to deal with transient outages. Moreover, we
> argue that our strategy is the better choice, both operationally and from a
> security standpoint.
>
> The paper shows that Routinator is the most used RPKI relying party
> software, and we know many of you here rely on it for route origin
> validation in a production environment. We take this responsibility and
> therefore this matter very seriously, and would not want you to think we
> have been careless in our software design. Quite the opposite.
>
> We have made several attempts within the IETF to have a discussion on
> technical merit, where aspects such as overwhelming an rsync server with
> traffic, or using aggressive fallback to rsync as an entry point to a
> downgrade attack have been brought forward. Our hope was that our arguments
> would be considered on technical merit, but that did not happen yet. Be
> that as it may, operators can rest assured that if consensus goes against
> our logic, we will change our design.
>
> > perhaps go over to your unbound siblings and discuss this analog.
>
> The mention of Unbound DNS resolver in this context is interesting,
> because we have in fact discussed our strategy with the developers on this
> team as there is a lot to be learned from other standards and operational
> experiences.
>
> We feel very strongly about this matter because the claim that using our
> software negatively affects Internet routing robustness strikes at the core
> of NLnet Labs’ existence: our reputation and our mission to work for the
> good of the Internet. They are the core values that make it possible for a
> not-for-profit foundation like ours to make free, liberally licensed open
> source software.
>
> We’re proud of what we’ve been able to achieve and look forward to a
> continued open discussion with the community.
>
> Respectfully,
>
> Alex
>


Re: plea for comcast/sprint handoff debug help

2020-10-30 Thread Alex Band


> On 30 Oct 2020, at 01:10, Randy Bush  wrote:
> 
> i'll see your blog post and raise you a peer reviewed academic paper and
> two rfcs :)

For the readers wondering what is going on here: there is a reason there is 
only a vague mention to two RFCs instead of the specific paragraph where it 
says that Relying Party software must fall back to rsync immediately if RRDP is 
temporarily unavailable. That is because this section doesn’t exist. The point 
is that there is no bug and in fact, Routinator has a carefully thought out 
strategy to deal with transient outages. Moreover, we argue that our strategy 
is the better choice, both operationally and from a security standpoint.

The paper shows that Routinator is the most used RPKI relying party software, 
and we know many of you here rely on it for route origin validation in a 
production environment. We take this responsibility and therefore this matter 
very seriously, and would not want you to think we have been careless in our 
software design. Quite the opposite.

We have made several attempts within the IETF to have a discussion on technical 
merit, where aspects such as overwhelming an rsync server with traffic, or 
using aggressive fallback to rsync as an entry point to a downgrade attack have 
been brought forward. Our hope was that our arguments would be considered on 
technical merit, but that did not happen yet. Be that as it may, operators can 
rest assured that if consensus goes against our logic, we will change our 
design.

> perhaps go over to your unbound siblings and discuss this analog.

The mention of Unbound DNS resolver in this context is interesting, because we 
have in fact discussed our strategy with the developers on this team as there 
is a lot to be learned from other standards and operational experiences. 

We feel very strongly about this matter because the claim that using our 
software negatively affects Internet routing robustness strikes at the core of 
NLnet Labs’ existence: our reputation and our mission to work for the good of 
the Internet. They are the core values that make it possible for a 
not-for-profit foundation like ours to make free, liberally licensed open 
source software. 

We’re proud of what we’ve been able to achieve and look forward to a continued 
open discussion with the community.

Respectfully,

Alex


Re: plea for comcast/sprint handoff debug help

2020-10-30 Thread Job Snijders
On Thu, Oct 29, 2020 at 09:14:16PM +0100, Alex Band wrote:
> In fact, we argue that it's actually a bad idea to do so:
> 
> https://blog.nlnetlabs.nl/why-routinator-doesnt-fall-back-to-rsync/
>
> We're interested to hear views on this from both an operational and
> security perspective.

I don't see a compelling reason to not use rsync when RRDP is
unavailable.

Quoting from the blog post:

"While this isn’t threatening the integrity of the RPKI – all data
is cryptographically signed making it really difficult to forge data
– it is possible to withhold information or replay old data."

RRDP does not solve the issue of withholding data or replaying old data.
The RRDP protocol /also/ is unauthenticated, just like rsync. The RRDP
protocol basically is rsync wrapped in XML over HTTPS.

Withholding of information is detected through verification of RPKI
manifests (something Routinator didn't verify up until last week!),
and replaying of old data is addressed by checking validity dates and
CRLs (something Routinator also didn't do until last week!).

Of course I see advantages to this industry mainly using RRDP, but those
are not security advantages. The big migration towards RRDP can happen
somewhere in the next few years.

The arguments brought forward in the blog post don't make sense to me.
The '150,000' number in the blog post seems a number pulled from thin
air.

Regards,

Job


Re: plea for comcast/sprint handoff debug help

2020-10-29 Thread Randy Bush
i'll see your blog post and raise you a peer reviewed academic paper and
two rfcs :)

in dnssec, we want to move from the old mandatory to implement (mti) rsa
signatures to the more modern ecdsa.

how would the world work out if i fielded a validating dns cache server
which *implemented* rsa, because it is mti, but chose not to actually
*use* it for validation on odd numbered wednesdays because of my
religious belief that ecdsa is superior?

perhaps go over to your unbound siblings and discuss this analog.

but thanks for your help in getting jtk's imc paper accepted. :)

randy


Re: plea for comcast/sprint handoff debug help

2020-10-29 Thread Randy Bush
>>> tl;dr:
>>> 
>>> comcast: does your 50.242.151.5 westin router receive the announcement
>>> of 147.28.0.0/20 from sprint's westin router 144.232.9.61?
>> 
>> tl;dr: diagnosed by comcast.  see our short paper to be presented at imc
>>   tomorrow https://archive.psg.com/200927.imc-rp.pdf
>> 
>> lesson: route origin relying party software may cause as much damage as
>>  it ameliorates
>> 
>> randy
> 
> To clarify this for the readers here: there is an ongoing research
> experiment where connectivity to the RRDP and rsync endpoints of
> several RPKI publication servers is being purposely enabled and
> disabled for prolonged periods of time. This is perfectly fine of
> course.
> 
> While the resulting paper presented at IMC is certainly interesting,
> having relying party software fall back to rsync when RRDP is
> unavailable is not a requirement specified in any RFC, as the paper
> seems to suggest. In fact, we argue that it's actually a bad idea to
> do so:
> 
> https://blog.nlnetlabs.nl/why-routinator-doesnt-fall-back-to-rsync/
> 
> We're interested to hear views on this from both an operational and
> security perspective.

in fact,  has found your bug.  if you find an http
server, but it is not serving the new and not-required rrdp protocol, it
does not then use the mandatory to implement rsync.

randy


Re: plea for comcast/sprint handoff debug help

2020-10-29 Thread Alex Band


> On 28 Oct 2020, at 16:58, Randy Bush  wrote:
> 
>> tl;dr:
>> 
>> comcast: does your 50.242.151.5 westin router receive the announcement
>> of 147.28.0.0/20 from sprint's westin router 144.232.9.61?
> 
> tl;dr: diagnosed by comcast.  see our short paper to be presented at imc
>   tomorrow https://archive.psg.com/200927.imc-rp.pdf
> 
> lesson: route origin relying party software may cause as much damage as
>   it ameliorates
> 
> randy

To clarify this for the readers here: there is an ongoing research experiment 
where connectivity to the RRDP and rsync endpoints of several RPKI publication 
servers is being purposely enabled and disabled for prolonged periods of time. 
This is perfectly fine of course.

While the resulting paper presented at IMC is certainly interesting, having 
relying party software fall back to rsync when RRDP is unavailable is not a 
requirement specified in any RFC, as the paper seems to suggest. In fact, we 
argue that it's actually a bad idea to do so:

https://blog.nlnetlabs.nl/why-routinator-doesnt-fall-back-to-rsync/

We're interested to hear views on this from both an operational and security 
perspective.

-Alex

Re: plea for comcast/sprint handoff debug help

2020-10-28 Thread Lukas Tribus
Hello,

On Wed, 28 Oct 2020 at 16:58, Randy Bush  wrote:
> tl;dr: diagnosed by comcast.  see our short paper to be presented at imc
>tomorrow https://archive.psg.com/200927.imc-rp.pdf
>
> lesson: route origin relying party software may cause as much damage as
> it ameliorates

There is a myth that ROV is inherently fail-safe (it isn't if your
production routers have stale VRP's) which leads to the assumption
that proper monitoring is neglectable.

I'm working on a shell script using rtrdump to detect stale RTR
servers (based on serial changes and the actual data). Of course this
would never detect partial failures that affect only some child-CAs,
but it does detect a hung RTR server (or a standalone RTR server where
the validator validates no more).


lukas


Re: plea for comcast/sprint handoff debug help

2020-10-28 Thread Randy Bush
> tl;dr:
> 
> comcast: does your 50.242.151.5 westin router receive the announcement
> of 147.28.0.0/20 from sprint's westin router 144.232.9.61?

tl;dr: diagnosed by comcast.  see our short paper to be presented at imc
   tomorrow https://archive.psg.com/200927.imc-rp.pdf

lesson: route origin relying party software may cause as much damage as
it ameliorates

randy


plea for comcast/sprint handoff debug help

2020-10-28 Thread Randy Bush
tl;dr:

comcast: does your 50.242.151.5 westin router receive the announcement
of 147.28.0.0/20 from sprint's westin router 144.232.9.61?

details:

3130 in the westin announces
  147.28.0.0/19 and
  147.28.0.0/20
to sprint, ntt, and the six
and we want to remove the /19

when we stop announcing the /19, a traceroute to comcast through sprint
dies at the handoff from sprint to comcast.

r0.sea#traceroute 73.47.196.134 source 147.28.7.1
Type escape sequence to abort.
Tracing the route to c-73-47-196-134.hsd1.ma.comcast.net (73.47.196.134)
VRF info: (vrf in name/id, vrf out name/id)
  1 r1.sea.rg.net (147.28.0.5) 0 msec 1 msec 0 msec
  2 sl-mpe50-sea-ge-0-0-3-0.sprintlink.net (144.232.9.61) [AS 1239] 1 msec 1 
msec 0 msec
  3  *  *  * 
  4  *  *  * 
  5  *  *  * 
  6  *  *  *

this would 'normally' (i.e. when the /19 is announced) be

r0.sea#traceroute 73.47.196.134 source 147.28.7.1
Type escape sequence to abort.
Tracing the route to c-73-47-196-134.hsd1.ma.comcast.net (73.47.196.134)
VRF info: (vrf in name/id, vrf out name/id)
  1 r1.sea.rg.net (147.28.0.5) 0 msec 1 msec 0 msec
  2 sl-mpe50-sea-ge-0-0-3-0.sprintlink.net (144.232.9.61) [AS 1239] 1 msec 0 
msec 1 msec
  3 be-207-pe02.seattle.wa.ibone.comcast.net (50.242.151.5) [AS 7922] 1 msec 0 
msec 0 msec
  4 be-10847-cr01.seattle.wa.ibone.comcast.net (68.86.86.225) [AS 7922] 1 msec 
1 msec 2 msec
  etc
  
specifically, when 147.28.0.0/19 is announced, traceroute from
147.28.7.2 through sprint works to comcast.  withdraw 147.28.0.0/19,
leaving only 147.28.0.0/20, and the traceroute enters sprint but fails
at the handoff to comcast.  Bad next-hop?  not propagated?  covid?
magic?

which is why we wonder what comcast (50.242.151.5) hears from sprint at
that handoff

note that, at the minute, both the /19 and the /20 are being announced,
as we want things to work.  so you will not be able to reproduce.

so, comcast, are you receiving the announcement of the /20 from sprint?
with a good next-hop?

randy