Hi,
I've been reading draft-dong-fantel-problem-statement. I think there is a
lot of interesting stuff in Fantel, and I'd like to see this problem
statement consolidated and pick up consensus.
My review threw up a few significant points, and a raft of editorials. I'd
be happy to discuss them more or see a new revision.
Best,
Adrian
===
Section 4.1 is presented as an example, so the text that follows in 4.1.1
should be limited to the tools that apply to the example, and not a general
description of the tools (that material is found elsewhere in the document).
So...
BFD
This paragraph is all true, but it should be better focussed on the example
in Figure 1. Congestion notification is not really part of that example
(although it is true that BFD doesn't help with it). So stick
to:
- Speed of BFD propagation
- Requirement to be running BFD with a very short cycle
- Load issues this may create
I am a little sceptical of the load concern because if the link is
ultra-high bandwidth (as you'd want in this example) and the number of such
links is small (as is likely in this example), running BFD on a very short
cycle is unlikely to cause a load issue.
ECN
This paragraph doesn't apply to the example at all.
---
There is a feeling of repetition. When I got to 4.1.1 and 4.1.2 it felt that
I had already seen this message in section 3. I think you need to separate
things out and re-order the document.
Addressing the previous point will help with this.
"Why Fast Network Notification is Needed" should come first, but it should
talk in technology-independent terms about the requirements. Then "The
Problem with Existing Notification Mechanisms" can explain why new
mechanisms are needed by showing the challenges with existing tools, and
include the example. And finally "Fast Network Notifications Detailed
Problem Statement" can present the details.
---
I think section 5.4 (with some interaction with 5.2) is the place to discuss
multiple recipients of the same notification. This is a classic problem even
in old alarm-based systems, and mechanisms are either designed into the
error reporting (such as APS) or are coordinated by the error processing
system (such as through an alarm management system).
However, I think you are targeting a different environment. That is, you are
not limited to the specific and simple topologies that are consistent with
APS, you are not looking only to achieve end-to-end protection switching,
and you envisage propagating the notifications to quite a number of
recipients.
The challenge becomes, what happens if multiple nodes all react to the
notification, making changes to traffic flows, and interacting with the
network in different ways?
I think this either needs coordination ("central control" at the level of
SDN) or careful pre-planning.
Probably, this section does not need to fully resolve this question, but it
should be raised as an issue so that the solutions work will take it into
account
---
Section 8 is good as far as it goes. I think you can add to it by thinking
in a paranoid way!
- Could the notifications reveal information about the network that is
intended to be private but now made visible to external snooping?
- Possibly by inspecting notifications
- Possibly by registering as a consumer of the notifications
- Could an attacker (or a misconfiguration) cause the reporting system
to becomes overwhelmed (perhaps by making it look like notifications
should be sent everywhere, or by flapping a resource) to the extent
that important notifications are lost, or the ability to control the
system is broken?
== Editorial (obviously, non-blocking) ==
Abstract
Nothing wrong here, but I'd swap some things around to make it clear why AI
training and real-time services need these features.
Modern networks require adaptive traffic manipulation including
Traffic Engineering (TE), load balancing, flow control, and
protection, to support high-throughput, low-latency, and lossless
applications such as AI training and real-time services.
A good and timely understanding of network operational status, such
as congestion and failures, can help to improve network utilization,
enable the selection of paths with reduced latency, and enable faster
response to critical events. This document describes the existing
problems and why the IETF may need a new set of fast network
notification solutions.
---
1.
OLD
This document summarizes the limitations of existing mechanisms that
prevent rapid notification and action to critical network events,
including link or node failures and congestion.
NEW
This document summarizes the limitations of existing mechanisms that
prevent them being used for rapid notification of critical network
events, including link or node failures and congestion.
END
---
1.
s/In the context of this draft/In the context of this document/
---
1.
This document describes why the IETF may need a new set of fast
network notification related solutions to support these use cases.
I don't think the IETF has "needs" like that. Posisbly vendors need protocol
tools to deliver function for the operators?
I'd also say that "may need" is too weak. s/may need/needs/
---
1.1 Needs to be updated to the correct boilerplate.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in
BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
capitals, as shown here.
You'll need to add a reference for RFC 8174.
---
I'd add the references into Section 2. I'd also put the terms in
alphabetical order.
BFD: Bidirectional Forwarding Detection [RFC5880]
ECN: Explicit Congestion Notification [RFC3168]
FRR: Fast Re-Route [RFC4090][RFC5714]
IOAM: In-situ Operations, Administration, and Maintenance [RFC9197]
---
3.
s/has deficiencies/have deficiencies/
s/is proposed as a/is a/
---
3.
There is
a demonstrable need for a standardized framework in IETF to define
these fast network notification mechanisms, requirements and
integration strategies.
Again. don't tell the IETF what it needs. Tell us about what implementers
and operators need. You might use this text in a BoF proposal, but not in a
document that will be published as an RFC.
---
3.
s/The following describes a summary/There follows a summary/
---
3.
* Slow Reaction:
I wonder about this term. I think it is "Slow Dissemination:" because the
problem is not that the node detecting the fault is slow to react.
OLD
What is needed is a lightweight signaling method
that can provide real-time alerts (e.g., at the level of sub-
milliseconds or milliseconds) on failures, congestion, or
threshold breaches, enabling immediate actions (e.g., in ms to 10s
ms ranges) in the network layer.
NEW
What is needed is a lightweight signaling method
that can provide real-time alerts (e.g., at the sub-millisecond
level or in the order of a few milliseconds) on failures,
congestion, or threshold breaches, enabling prompt actions (e.g.,
in the range of a millisecond to 10s of milliseconds) in the
network layer.
END
---
3.
s/capacity , or/capacity, or/
s/reports but/reports, but/
s/load-balancing, flow-control/load-balancing, flow-control,/ s/and can
lead/and leading/
---
3.
The local view of network status prevents precise and globally
optimized decisions and adjustments. It would be helpful to send
fast network notifications to upstream nodes which can perform
action based on the view of regional or global network conditions.
The second sentence helps to explain, but I think that the term "global
optimization" has a wider concept and can only be achieved using centralized
computation and a complete view of all network conditions and traffic flows.
I think you are trying to do something in between.
That is, you are trying to have a node make decisions about how to steer
traffic that it is responsible for, with awareness of a set of network
conditions that are relevant to the paths it might choose. So I would
suggest
NEW
This local view of network status prevents precise and optimized
decisions and adjustments. It would be helpful to send fast
network notifications to upstream nodes so that they can perform
action based on a wider view of network conditions.
END
---
3.
s/(e.g. routing/(e.g., routing/
s/(e.g. AI workloads/(e.g., AI workloads/
---
4.
s/In particular, failure-detection/Failure-detection/
s/(fine-grained vs. coarse-grained)/(fine-grained vs. coarse-grained)/
s/Therefore, this draft/Therefore, this document/
---
4.1
I think the terms AI, ML, and GPU need expansion on first use.
---
4.1.1 BFD
OLD
The
former lacks of global network situation thus may cause congestion
on the backup paths, while the latter may breach strict
synchronization deadlines.
NEW
The
former lacks visibility of the global network situation and thus
may cause congestion on the backup paths, while the latter may
breach strict synchronization requirements of the AI/ML
application.
END
---
4.1.2
s/can be affected/might be affected/
But I wonder how the nodes adjacent to the failure know which node to
notify. Obviously, all adjacent nodes (except any connect only by the failed
link). But the whole point seems to propagate the notification further.
---
5.1
s/timely actioned/actioned in a timely manner/
---
5.2
s/recipients:/recipient:/
s/functional consumers:/functional consumers./ s/in the figure above/in
Figure 2/
---
5.2
Tables 1 and 2 have a column "Example Benefit." It is unclear "benefit of
what, and to whom." I think you can handle this with a little more
introductory text, like...
The tables have three columns. The fist column lists the type or
node or type of application/function. The second shows the role that
the node or application/function is responsible for within the
network that could benefit from fast network notifications. The
third column indicates examples of how fast notification could
benefit the node/application/function in filling its role.
---
Table 2 has...
| Traffic Engineering | Centralized | Pre-compute new paths |
| Element (PCE) | optimization | before congestion |
| | | propagates |
It is true that this is one role of the PCE, and also true that a PCE is a
component of a "traffic engineering element." But I think that it is not the
primary role. Perhaps, in paragraph I suggested above, it should say that
the second column shows an example of the role.
---
It looks like there is an implication in Figure 2 that notifications flow
from data plane to control plane to management plane to application plane.
I hope that isn't your intention, because I don't think that is how things
work.
Maybe the figure is just a list of four catgories of notification recipient
without the arrows?
---
5.2
"near-instantaneous"
Same concern as before: everything is relative, but "near instantaneous"
is probably going to attract the wrong response from people. Maybe, "very
quick," or even, "very, very quick."
---
5.2
s/something needs/something that needs/
---
5.3
s/recipient noded/recipient node/
---
5.4
OLD
The possible actions to the notification can be but not
limit to one or multiple of the following:
NEW
The possible actions in response to the notification can be, but not
limited, to one or more of the following:
END
---
I'm not sure Section 6 adds to the draft.
_______________________________________________
rtgwg mailing list -- [email protected]
To unsubscribe send an email to [email protected]