This document seems related to the anomaly detection and incident management
work currently happening in the NMOP WG.
Please see PA: inline.
Abstract
This document defines an information model and a corresponding YANG
data model for packet discard reporting. The information model
provides an implementation-independent framework for classifying
packet loss to enable automated network mitigation of unintended
packet loss. The YANG data model specifies an implementation of this
framework for network elements.
PA: Is the model not applicable to intentional packet loss, ie when traffic is
dropped due to policy?
1. Introduction
Existing metrics for reporting packet loss, such as ifInDiscards,
ifOutDiscards, ifInErrors, and ifOutErrors defined in MIB-II
[RFC2863] and the YANG Data Model for Interface Management [RFC8343],
are insufficient for automating network operations. First, they lack
precision; for instance, ifInDiscards aggregates all discarded
inbound packets without specifying the cause, making it challenging
to distinguish between intended and unintended discards. Second,
these definitions are ambiguous, leading to inconsistent vendor
implementations. For example, in some implementations ifInErrors
accounts only for errored packets that are dropped, while in others,
it includes all errored packets, whether they are dropped or not.
Many implementations support more discard metrics than these,
however, they have been inconsistently implemented due to the lack of
a standardised classification scheme and clear semantics for packet
loss reporting. For example, [RFC7270] provides support for
reporting discards per flow in IPFIX using forwardingStatus, however,
the defined drop reason codes also lack sufficient clarity to
facilitate automated root cause analysis and impact mitigation, e.g.,
the "For us" reason code.
PA: So do you propose to update or supersede those documents?
3. Problem Statement
The fundamental problem for network operators is how to automatically
detect when unintended packet loss is occurring and determine the
appropriate action to mitigate it. For any network, there are a
small set of potential actions that can be taken to mitigate customer
impact when unintended packet loss is detected:
1. Take a problematic device, link, or set of devices and/or links
out of service.
2. Return a device, link, or set of devices and/or links back into
service.
3. Move traffic to other links or devices to alleviate congestion or
avoid problematic paths.
4. Roll back a recent change to a device that might have caused the
problem.
5. Escalate to a network operator as a last resort when automated
mitigation is not possible.
PA: Is this intended to be an exhaustive list of all the reasons? I could think
of others, eg update a device configuration or firmware, whether to fix a bug
or to correctly classify the traffic so it's nolonger dropped.
The ability to select the appropriate mitigation action depends on
four key features of packet loss:
FEATURE-DISCARD-LOCATION: Determines which devices, interfaces and/
or flows are impacted.
PA: "flows" isn't a location. It describes what is being dropped rather than
where it's being dropped.
FEATURE-DISCARD-RATE: The rate and/or magnitude of the discards,
indicating the severity and urgency of the problem.
PA: How is this measured, eg pps or % of bandwidth? Is it meaningful to compare
rates on different interfaces or devices? eg, two interfaces may discard
traffic at a rate of 40pps, but it's a more serious problem on the interface
that's transmitting 50pps than on the interface that's transmitting 500pps.
FEATURE-DISCARD-DURATION: The duration of the discards which helps
to distinguish transient from persistent issues.
FEATURE-DISCARD-CLASS: The type or class of discards, which is
crucial for selecting the appropriate of mitigation - for example:
error discards may require taking faulty components out of
service; no-buffer discards may require traffic redistribution;
policy discards typically require no automated action
PA: "policy discards" are intentional, whereas the Abstract, Introduction, and
Problem Statement focus on unintended packet loss.
4.2. Sub-type Definitions
PA: Nit: please consistently terminate each of the following sub-sections with
a period.
discards/policy/: These are intended discards, meaning packets
dropped by a device due to a configured policy, including: ACLs,
traffic policers, Reverse Path Forwarding (RPF) checks, DDoS
protection rules, and explicit null routes
discards/error/: These are unintended discards due to errors in
processing packets or frames. There are multiple sub-classes.
discards/error/l2/rx/: These are frames discarded due to errors in
the received Layer 2 frame, including: CRC errors, invalid MAC
addresses, invalid VLAN tags, frame size violations and other
malformed frame conditions
Evans, et al. Expires 1 January 2026 [Page 9]
Internet-Draft IM and DM for Packet Discard Reporting June 2025
discards/error/l3/rx/: These are discards which occur due to errors
in the received packet, indicating an upstream problem rather than
an issue with the device dropping the errored packets, including:
header checksum errors, MTU exceeded, invalid packet errors, i.e.,
incorrect version, incorrect header length, invalid options and
other malformed packet conditions
discards/error/l3/rx/ttl-expired: These are discards due to TTL (or
Hop limit) expiry, which can occur for the following reasons:
normal trace-route operations, end-system TTL/Hop limit set too
low, routing loops in the network.
discards/error/l3/no-route/: These are discards which occur due to a
packet not matching any route in the routing table, e.g., which
may be due to routing configuration errors or may be transient
discards during convergence.
discards/error/internal/: These are discards due to internal device
issues, including: parity errors in device memory or other
internal hardware errors. Any errored discards not explicitly
assigned to other classes are also accounted for here.
discards/no-buffer/: These are discards due to buffer exhaustion,
i.e. congestion related discards. These can be tail-drop discards
or due to an active queue management algorithm, such as RED
[RED93] or CODEL [RFC8289].
An example of possible signal-to-mitigation action mapping is
provided in Appendix B.
4.3. "ietf-packet-discard-reporting-sx" YANG Module
The "ietf-packet-discard-reporting-sx" module uses the "sx" structure
defined in [RFC8791].
<CODE BEGINS>
PA: FYI, "rfcstrip" seems to be inserting an extra blank line at each page
break which I haven't noticed with other drafts/RFCs.
5.2. Implementation Requirements
The following requirements apply to the implementation of the data
model and are intended to ensure consistent implementation across
different vendors and platforms while allowing for platform-specific
optimisations where needed. While the model defines a comprehensive
set of counters and statistics, implementations MAY support a subset
of the defined features based on device capabilities and operational
requirements. However, implementations MUST clearly document which
features are supported and how they map to the model.
Requirements 1-11 relate to packets forwarded or discarded by the
device, while requirement 12 relates to packets destined for or
originating from the device:
1. All instances of Layer 2 frame or Layer 3 packet receipt,
transmission, and discards MUST be accounted for.
PA: What if the problem is that the device has insufficient resource for this
accounting?
2. All instances of Layer 2 frame or Layer 3 packet receipt,
transmission, and discards SHOULD be attributed to the physical
or logical interface of the device where they occur. Where they
cannot be attributed to the interface, they MUST be attributed
to the device.
3. An individual frame MUST only be accounted for by either the
Layer 2 traffic class or the Layer 2 discard classes within a
single direction or context, i.e., ingress or egress or device.
This is to avoid double counting.
4. An individual packet MUST only be accounted for by either the
Layer 3 traffic class or the Layer 3 discard classes within a
single direction or context, i.e., ingress or egress or device.
This is to avoid double counting.
Evans, et al. Expires 1 January 2026 [Page 26]
Internet-Draft IM and DM for Packet Discard Reporting June 2025
5. A frame accounted for at Layer 2 SHOULD NOT be accounted for at
Layer 3 and vice versa. An implementation MUST indicate which
layers traffic and discards are counted against. This is to
avoid double counting.
PA: By what means MUST this be indicated?
6. The aggregate Layer 2 and Layer 3 traffic and discard classes
SHOULD account for all underlying frames or packets received,
transmitted, and discarded across all other classes.
7. The aggregate QoS traffic and no-buffer discard classes MUST
account for all underlying packets received, transmitted, and
discarded across all other classes.
8. In addition to the Layer 2 and Layer 3 aggregate classes, an
individual discarded packet MUST only account against a single
error, policy, or no-buffer discard subclass.
9. When there are multiple reasons for discarding a packet, the
ordering of discard class reporting MUST be defined.
PA: How and where should it be defined?
10. If Diffserv [RFC2475] is not used, no-buffer discards SHOULD be
reported as class[id="0"], which represents the default class.
11. When traffic is mirrored, the discard metrics MUST account for
the original traffic rather than the reflected traffic.
12. Traffic to the device control plane has its own class. However,
traffic from the device control plane MUST be accounted for in
the same way as other egress traffic.
PA: Wouldn't it be useful to account locally generated issues separately?
6.2. Data Model
This section is modeled after the template described in Section 3.7
of [I-D.ietf-netmod-rfc8407bis].
The YANG module specified in Section 5.4 defines a data model that is
designed to be accessed via YANG-based management protocols, such as
NETCONF [RFC6241] and RESTCONF [RFC8040]. These YANG-based
management protocols (1) have to use a secure transport layer (e.g.,
SSH [RFC4252], TLS [RFC8446], and QUIC [RFC9000]) and (2) have to use
mutual authentication.
PA: Is "have to use" a BCP 14 "MUST"?
PA: Nit: something odd happens to the indentation below:
control-plane, interfaces, and devices: Access to these data nodes
would reveal information about the attacks to which an element is
subject, misconfigurations, etc.
Also, an attacker who can inject packets can infer the efficiency
of its attack by monitoring (the increase of) some discard
counters (e.g., policy) and adjust its attack strategy
accordingly.
Appendix C. Implementation Experience
This appendix captures practical insights gained from implementing
this information model across multiple vendors' platforms, as
guidance for future implementers.
1. The number and granularity of discard classes defined in the
information model represent a compromise. It aims to provide
sufficient detail to enable appropriate automated actions while
avoiding excessive detail, which may hinder quick problem
identification. Additionally, it helps to limit the quantity of
data produced per interface, constraining the data volume and
device CPU impacts. While further granularity is possible, the
defined schema has generally proven to be sufficient for the
task of mitigating unintended packet loss.
2. There are many possible ways to define the discard
classification tree. For example, we could have used a multi-
rooted tree, rooted in each protocol. Instead, we opted to
define a tree where protocol discards and causal discard classes
are accounted for orthogonally. This decision reduces the
number of combinations of classes and has proven sufficient for
determining mitigation actions.
3. NoBuffer discards can be realized differently with different
memory architectures. Whether a NoBuffer discard is attributed
to ingress or egress can differ accordingly. For successful
auto-mitigation, discards due to egress interface congestion
should be reported on egress, while discards due to device-level
congestion (e.g. due to exceeding the device forwarding rate)
should be reported on ingress.
PA: Are those BCP-14 SHOULDs?
9. In cases where the reporting device is the source or destination
of a tunnel, the ingress protocol for a packet may differ from
the egress protocol (e.g., if IPv4 is tunneled over IPv6). Some
implementations may attribute egress discards to the ingress
protocol.
PA: Attributing in this way could mislead diagnosis, so whether and when to do
this should be discussed in more detail in the body of the document.
10. While the classification tree is seven layers deep, a minimal
implementation may only implement the top six layers.
PA: Is that a BCP 14 "MAY" ?
_______________________________________________
OPSAWG mailing list -- [email protected]
To unsubscribe send an email to [email protected]