[GROW] Re: bmp path marking churn

Camilo Cardona Tue, 25 Mar 2025 06:56:36 -0700

Hi Robert,

If this mechanism existed, I believe I would use it. In general, I would use 
anything that will force the device to do any back-pressure internally to hide 
the churn behind state compression. Others might prefer the full real time 
feed, if the devices supports it.

I don’t think we will add anything this low level to this document, but we 
might consider adding it to the BMP Yang model if we have any implementation 
that we could use as an example.. for instance, we included the initial timer 
from cisco devices that delay BMP waiting for convergence.

Thanks,

Camilo

From: Robert Raszuk <[email protected]>
Date: Monday, 24 March 2025 at 15:30
To: Camilo Cardona <[email protected]>
Cc: Jeffrey Haas <[email protected]>, 
<[email protected]>, <[email protected]>
Subject: Re: [GROW] Re: bmp path marking churn

Hi Camilo,

I was curious if we would provide a discrete churn color/marker only would it 
not address your overall need ? 

Example: 

Let's consider 5 default markings (it could be more if needed - this is just to 
describe the idea) - the defaults are just for illustration purposes and they 
could be configurable locally to override the default thresholds: 

green - less then 10 transitions per minute 

yellow - 10-100 transitions per minute 

orange - 100-500 transitions per minute 

red - 500-1000 transitions per minute 

black - more then 1000 transitions per minute

Moreover BMP may be configured to only advertise some of the above. Then while 
advertising the colors we may also introduce optional counter of transitions 
between colors such that we don't need to advertise each transition in a 
separate messages. 

Would that address the overall need yet do not generate excessive BMP churn :) 
? 

Cheers,

Robert

On Mon, Mar 24, 2025 at 8:29 PM Camilo Cardona <[email protected]> wrote:

Hello Jeff,

I’m sad you couldn’t sleep on your way back, but happy you rewarded us with 
this analysis. Thank you.

First, it seems to me that the extra churn caused by enabling this TLV would 
depend on (1) The BMP feed’s RIB source. (2) Whether reason codes are enabled. 
Do we agree that with reason codes off, churn would be lower? Also, sourcing 
from adj-rib-in-pre, for instance, would result in less churn than sourcing 
from loc-rib?

My questions aim to stress that the proposed marking mechanism creates 
scenario-dependent churn—worse in some cases than others. We could add text 
describing the churn in the document, but the end goal of the document is still 
to standardize the TLV (where to mark), rather than analyze every situation 
that stems from it.

Regarding your question and the scenario you describe, I personally care more 
about the final state than transient churn. Therefore, it would be nice to use 
as many tricks as we can to hide the churn behind state compression, but I 
accept some churn is unavoidable, and we will have to handle this on the 
receiving end.

You suggest delayed marking to reduce (not avoid) churn. Is this something we 
could propose to be configurable (something like a timer?), or would this be 
too implementation-dependent to be generalized?

Thanks,
Camilo

On 23/3/25, 10:09, "Jeffrey Haas" <[email protected] <mailto:[email protected]>> 
wrote:

Camilo,

On Fri, Mar 21, 2025 at 08:19:57PM -0500, Camilo Cardona wrote:
> Yes, and not only for the backup paths, we also have options for marking 
> non-selected paths and their churn might be even worse.

I should have been more precise. When I typed "backup", it would have been
better to say "non-active paths". 

> We know that this might be complicated for the devices. Section 3. explains 
> that the reason code should be optional, and devices should provide options 
> for enable or disable the reason code. Do you think there are other 
> implementation guidelines that we can consider that would facilitate this on 
> the devices?

In addition to looking at the draft again as part of inbox cleanup, I spent
a few moments of under-slept time while traveling home from IETF-122
pondering what my employer's implementation would need to do to support the
draft. I suspect some many of my conclusions would apply to other
implementations.

Consider the case where the system is in initial route learning state. For
simplicitly, we have N (N >= 2) feeds of the Internet.

During route learning, it's possible for us to learn a route and cause the
prior active route to lose best path status each time, resulting in N-1
changes. If we had a fast enough loc-rib feed, it would be possible to see
this churn in many circumstances depending on the route properties. In
practice, this churn doesn't make it all the way to a monitoring station due
to state compression. Our implementation prioritizes advertisement of
loc-rib after rib-in which further helps suppress the churn.

With path marking on rib-in for the scenario above, we are not only having
to report the newly learned route, but also eventually enqueue the route
that just lost being best path to have the path-marking TLV.

We are lucky in most circumstances that once a route has lost best path
status, the reason why is likely to be fairly consistent. This means that
for the above worst case, we're having to advertise O(2*N) rib-in messages
rather than O(N) during learning.

There's also a matter that if we were engaging in active path marking during
route learning and passing the reason for the churn in actively to the feed
that we may delay end-of-rib status for the learned routes.

That's perhaps problematic.

So, what if the marking happened somewhat later to avoid churning the
system. What could this look like?

One answer would be that the inactive paths have their rib-in entries
re-queued for bmp advertisement with the path marking TLV. If N-1 paths are
re-sent for the entire rib-in, that's still substantial traffic.

An interesting related question is what time-stamp should you use in the
per-peer header? RFC 7854 suggests it's the route-learning time. As we
discussed during grow in IETF 122 one's "faith" in timestamp accuracy may be
low but perhaps mostly good enough for rib-in for most implementations.
Very likely we'd like to use the original learning time-stamp in such a
"path-marking status only update". That could permit receiving stations to
avoid treating this metadata update as actual path churn.

For the hypothetical situation above, the active path may churn much less
than N-1 times. So, perhaps this doesn't appear as problematic as it could?

A second scenario for consideration is a change of policy or even IGP cost
due to IGP churn.

For a change in policy impacting the rib-in (import policy), this means that
not only is the rib-in-post for the impacted peer that has had its policy
changed, but also other rib-in views to reflect a potentially new reason the
route is now inactive.

For IGP cost, this isn't previously reflected in the rib-in-post view.
However, now such a change may directly manifest in the status changes.

-----

Overall, for a few simple situations it seems like the overall churn in BMP
is significantly higher.

I'm not going to press the point that it is impossible to reflect these
things in the protocol. What I'm curious about is whether users of BMP find
this churn problematic or not?

Do the authors have implementations for this feature that they can share
their observations about these potential extension consequences and how
they've mitigated them?

-- Jeff

_______________________________________________
GROW mailing list -- [email protected]
To unsubscribe send an email to [email protected]

_______________________________________________
GROW mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[GROW] Re: bmp path marking churn

Reply via email to