Hi Toerless,

with respect to routing in general, there is a trade-off between the
timeliness of the routing information and rate limiting. Let's assume
you use a TCP-like congestion control for routing protocols: in case
your sender is throttled heavily, the delay of routing information
propagation is probably too high and the routing information lacks
seriously behind, thus being outdated when received.

I think one has to distinguish where the bottleneck is:
1) link bandwidth (link congestion)
2) routing message processing (CPU congestion)

So in case of 1), routing messages will be dropped, which may
lead to retransmissions in case the routing protocol needs
reliable message delivery and also cause sending rate reduction
in case there is congestion control in place.
In case 2) queues build up inside the router causing also
serious delay and also the potential problem of obsolete
routing information. So the whole routing system could become
instable in extreme cases.

A number of typical pitfalls is summarized in this presentation here:
https://datatracker.ietf.org/meeting/102/materials/slides-102-rtgarea-those-who-do-not-learn-history-are-doomed-to-repeat-it-00

This presentation also mentions the widely known practice
to prioritize routing control messages over data plane traffic
so that the routing control traffic is not adversely affected
by congestion in the data plane. Moreover, it also mentions
the oscillation effects that happened with delay-based routing
as well as the OSPF Flooding issue in denser topologies.

I actually have not the operational experience as others
may have, but my guess is that practically CPU congestion
occurs more often than link congestion (solely caused by
control plane packets). While I believe that TCP congestion control
may potentially help to fix short-term congestion situations,
it is not a solution for persistent link congestion – I think that
such a system may not be able to function correctly.

So there are typically dampening mechanisms in place to aggregate routing information or to wait before announcing certain
route updates.
When using TCP, the CPU congestion problem would cause
flow control to kick in and automatically throttle
the sender to the receiver's processing speed.
However, if the generation rate of routing messages is
permanently too high, the system will not be stable.

Regards,
 Roland

On 02.12.22 at 18:03 Toerless Eckert wrote:
Dear routing-discussion / TSV folks
(sorry for escalating this, but it really bugs me - Cc'ing PIM/BIER)

What are these days the expectations against let's say a full Internet Standard
for a routing protocol to support in terms of congestion safe behavior ? And
what are congestion control expectation for new routing protocl RFCs even if
just proposed standard ?

I am asking, because i think that our core IP multicast routing protocol
fails miserably on this end, and quite frankly i do not understand how
PIM-SM (RFC7761) could have become a full Internet standard given how it
has zilch discussion about congestion or loss handling.

[ Especially, when in comparison a protocol like RFC7450 where TSV did raise 
concerns
   about multicast data plane congestion awareness, and it  was held up for 
years, and
   GregS as the WG-chair for the WG responsible for RFC7450 had to even help
   co-author RFC8085 to cut through the congestion control concern-cord. But 
likely
   all for the better!].

To quickly summarize the issue with PIM-SM to those who do not know it:

                  /- R2 -------- R6 -\
      Rcvrs ... R1                    R7 ... Senders
                  \- R3 -- R4 -- R5 -/

         CE ... PE .. P    P     P    PE  CE ...

R1 has let's say 100,000 ulticast/PIM (S,G) states with sources behind R7, so
it has to maintain 1000,000 so-called PIM (S,G) joins across the path R2, R6, 
R7.
Lets say roughly an (S,G) join for IPv6 is about 38 byte (IPv6), maybe 35 (S,G)
per 1500 byte packet, so 2857 packets of 1500 byte to carry all 100,000 (S,G).

Assume link R6/R7 fails, IGP reconverges, R1 recognizes that it needs to
change path, so it sends 2857 PIM-SM packets with prunes to R2 and 2857 PIM -SM
packets with joins to R3.

Assume R1 is a PE, R2 and R3 are P routers in an SP, and actually R2/R3 connect
to lets say 100 routers like R1. Now R2 and R3 get 100 x 2857 1500 byte packets.

And there is nothing in the PIM-SM spec that talks about how to throttle this
heap of PIM-SM packets. Typically, routers would just send them back-to-back.
And those packets repeat every 60 seconds given how PIM-SM is datagram / 
periodic
soft-state.  In fact, if you try to scale this in production networks, you will
most likely fail a lot more than IP multicast in those routers, because PIM not
only will badly compete on control-plane CPU time, but even more so on 
control-plane
to hardware-forwarding time when updating the 100,000 (S,G) hardware forwarding 
entries.

Correct me if i am wrong, but did the same type of issues in ISIS/OSPF in
DC because of so many parallel paths and hence duplication of LSA recently
lead to the creation of multiple IETF working groups in RTG to solve these
issues ?

In IP multicast, we where well aware of these issues and they where a core
reason to not build a PIM-based MPLS multicast protocol, but use the TCP based 
LDP
to specify mLDP (RFC6388). Same thing, when various BGP multicast work was
done as an alternative to PIM for SPs (BCP also being TCP based).

We did even fix this problem in PIM by specifying RFC6559 (PIM over TCP),
but instead of making that mechanisms mandatory and become the only option
for PIM when moving PIM up the IETF standards ladder to RFC7761, that
RFC had seemingly fallen into ignorance in the IP Multicast community,
because most IP multicast deployments are small enough that these issues
do not occur.

So, why do i escalate this issue now ?

We have a great new multicast architecture called BIER that eliminates
all this PIM multicast state issues from the P routers of such large
service provider networks by being stateless. But it still leaves the
need for overlay signaling, such as with PIM to operate between the
PE, such as in above picture the hundreds if not thousands
of receiver PE R1' and sender PE R7'. In which case you would have
PIM directly between those R1'/R7' across multihop paths, leading
to even more congestion considerations. And in support of such BIER networks,
there is a draft draft-hb-pim-light proposed to PIM-WG to optimize PIM 
explicitly
for this type of deployment. And when i said in PIM@IETF115, that such a draft 
IMHO
should only allowed to proceed when it is written to say it MUST
be based on PIM over TCP (RFC6388), all other people responding
on the thread said at best it could be be a MAY. Aka: Congestion control 
optional.

Am i a congestion control extremist ? I really only want to have
scaleable, reliably multicast RFCs, especially when they aspire and
go to full IETF standard and are meant to support our next-gen IP Multicast
architectures (BIER). I do fully understand how there is a lot
of cost pressure on vendor development, and having procrastinated
to implement, proliferate and deploy PIM over TCP so far (almost a decade!)
does make this a less attractive choice short term. And the whole purpose
of the PIM light draft of course is to reduce the amount of development needed
by making PIM more "light" (which is a good think). But when it
carries forward the problems of PIM to another generation of networks
(using BIER) that was especially built to scale better, then one
should IMHO really become worried. At least i do. But i also struggled to
implement datagram PIM processing for 100,000 states in a prior life
and then pushed for PIM over TCP...

Thanks!
     Toerless

_______________________________________________
routing-discussion mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/routing-discussion

Reply via email to