Hi Reshad, Happy new year!
Thanks a lot for your review comments, please see some replies inline with [Jie]: From: Reshad Rahman <[email protected]> Sent: Saturday, January 3, 2026 2:51 AM To: RTGWG <[email protected]> Subject: [rtgwg] draft-dong-fantel-problem-statement Hi, I took a look at the doc and here are some comments/questions. FYI I haven't caught up to all discussions/threads... Some high level comments first: - I support this work! [Jie] Thanks for your support! - I assume congestion avoidance is not part of this effort? i.e. is upstream handling of the fast notifications (e.g. PFC, DCQCN etc) outside of the scope of this document? [Jie] Yes, although there is one sub-section describing the possible actions to the fast notifications, the detailed action mechanisms are considered out of scope. Section 3 --------- What is needed is a lightweight signaling method that can provide real-time alerts (e.g., at the level of sub- milliseconds or milliseconds) on failures, congestion, or threshold breaches, enabling immediate actions (e.g., in ms to 10s ms ranges) in the network layer. I think "10s" could be misinterpreted as ten seconds (instead of tens of ms), so spell it out out e.g. "in milliseconds to tens of milliseconds" [Jie] Thanks for catching this, we will rephrase the text to avoid confusion. Section 4 --------- Therefore, this draft focuses on mechanisms capable of operating within these millisecond/sub-millisecond ranges, rather than mechanisms whose latency spans tens or hundreds of milliseconds, which are insufficient for preventing transient overload under rapid traffic transitions. Here it says that tens of milliseconds is not good enough, so that contradicts what is in section 3? [Jie] Basically what we want to say is that the notification needs to be in the order of sub-milliseconds or milliseconds, so that the action can be finished in milliseconds to tens of milliseconds. We will update the text in section 3 and 4 to make this clearer. Section 4.1 ----------- I believe Fig 1 is not the best representation of the problem space since the local failure should be detected quickly? Consider adding a node upstream of the failure location, that new node would need fast notification of the failure. [Jie] Thanks for your suggestion, I agree this figure can be modified to better reflect the problem space. Section 4.1.1 ------------- * BFD [RFC5880]: Provides fast forwarding path failure detection. It can be used for both link and path failure detection, while it cannot be used to detect link or path congestion, nor can it notify the failure or congestion to other nodes in the network. For "other nodes", clarify that it's nodes other than the BFD endpoints? [Jie] Yes, will clarify this in next revision. BFD is preconfigured with periodic message exchange, while fast notifications needs to be event-driven. When the transmit interval is set to a small value (e.g., at the level of ms), frequent BFD message exchange may become a burden to some systems. Some platforms can't do BFD at "low ms" interval but I'm assuming that the platforms of interest here would have no issue supporting that? [Jie] We plan to remove the text about the possible burden for running BFD at ms interval, as this only impacts the time of failure detection, which is not the focus of this document. * FRR [RFC4090][RFC5714]/Route convergence: Without fast notification, the failure detection can take tens of milliseconds, followed by either local repair (FRR) or route convergence. The former lacks of global network situation thus may cause congestion on the backup paths, while the latter may breach strict synchronization deadlines. Local repair (FRR) is local in that the fast reroute occurs locally (at the point of failure detection). But planning for the backup paths is not necessarily just a local matter, i.e. the local node has the global view? [Jie] The point of local repair may have the view of the topology, while it may not have the congestion/failure information of other paths. Section 4.1.2 ------------- * Action-Oriented Response: Upon receiving the notification, routing and load balancing mechanisms could instantly shift traffic to backup paths or alternative DC interconnects. That could also cause more congestion elsewhere in the network? If multiple nodes get the fast notification and they all decide (at around the same time) to use e.g. some alternate paths with highest cost, those paths may also be congested? Does that mean that rerouting needs to be triggered from a centralized entity (but that means extra delay in reacting to the event)? Is that also outside the scope of this document? [Jie] What you described is a problem to be considered when an notification is sent to multiple recipients. We plan to add some text to capture this in an upcoming revision. The specific solution to this problem will be specified in an solution document. Coordination by a centralized entity is one possible way, while as you mentioned it would cause extra delay. Best regards, Jie Regards, Reshad.
_______________________________________________ rtgwg mailing list -- [email protected] To unsubscribe send an email to [email protected]
