On 8/18/23 22:40, Matthew Petach wrote:
Hi Robert,
Without naming any names, I will note that at some point in the
not-too-distant past, I was part of a new-years-eve-holiday-escalation
to $BACKBONE_ROUTER_PROVIDER when the global network I was involved
with started seeing excessive convergence times (greater than one hour
from BGP update message received to FIB being updated).
After tracking down development engineer from $RTR_PROVIDER on the new
years eve holiday, it was determined that the problem lay in
assumptions made about how communities were stored in memory. Think
hashed buckets, with linked lists within each bucket. If the
communities all happened to hash to the same bucket, the linked list
in that bucket became extremely long; and if every prefix coming in,
say from multiple sessions with a major transit provider, happened to
be adding one more community to the very long linked list in that one
hash bucket, well, it ended up slowing down the processing to the
point where updates to the FIB were still trickling in an hour after
the BGP neighbor had finished sending updates across.
A new hash function was developed on New Year's day, and a new version
of code was built for us to deploy under relatively painful
circumstances.
It's easy to say "Considering that we are talking about control
plane memory I think the cost/space associated with storing
communities is less then negligible these days."
The reality is very different, because it's not just about efficiently
*storing* communities, it's really about efficiently *parsing and
updating* communities--and the choices made there absolutely *DO*
"contribute to longer protocol convergences in any measurable way."
Matt
(the names have been obscured to increase my chances of being hireable
in the industry again at some future date. ;)
To be fair, you are talking about an arbitrary value of years back, on
boxes you don't name running code you won't mention.
This really not saying much :-).
Corner cases, while valid, do not speak to the majority. If this was a
major issue, there would have been more noise about it by now.
There has been quite some noise about lengthy AS_PATH updates that bring
some routers down, which has usually been fixed with improved BGP code.
But even those are not too common, if one considers a 365-day period.
Mark.