Hi Paul, Many thanks for your feedback.
Please see JE: inline Cheers John From: "Aitken, Paul" <[email protected]> Date: Tuesday 15 July 2025 at 21:39 To: "<[email protected]>" <[email protected]> Cc: "[email protected]" <[email protected]>, opsawg <[email protected]> Subject: [EXTERNAL] [OPSAWG]draft-ietf-opsawg-discardmodel-08 review CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. This document seems related to the anomaly detection and incident management work currently happening in the NMOP WG. Please see PA: inline. Abstract This document defines an information model and a corresponding YANG data model for packet discard reporting. The information model provides an implementation-independent framework for classifying packet loss to enable automated network mitigation of unintended packet loss. The YANG data model specifies an implementation of this framework for network elements. PA: Is the model not applicable to intentional packet loss, ie when traffic is dropped due to policy? JE: The model separately classifies intended and unintended discards; the nuance was that this is so that we can mitigate unintended discards. Will rephrase to make clearer. 1. Introduction Existing metrics for reporting packet loss, such as ifInDiscards, ifOutDiscards, ifInErrors, and ifOutErrors defined in MIB-II [RFC2863] and the YANG Data Model for Interface Management [RFC8343], are insufficient for automating network operations. First, they lack precision; for instance, ifInDiscards aggregates all discarded inbound packets without specifying the cause, making it challenging to distinguish between intended and unintended discards. Second, these definitions are ambiguous, leading to inconsistent vendor implementations. For example, in some implementations ifInErrors accounts only for errored packets that are dropped, while in others, it includes all errored packets, whether they are dropped or not. Many implementations support more discard metrics than these, however, they have been inconsistently implemented due to the lack of a standardised classification scheme and clear semantics for packet loss reporting. For example, [RFC7270] provides support for reporting discards per flow in IPFIX using forwardingStatus, however, the defined drop reason codes also lack sufficient clarity to facilitate automated root cause analysis and impact mitigation, e.g., the "For us" reason code. PA: So do you propose to update or supersede those documents? JE: Practically, I think the model should supersede the definitions of ifInDiscards, ifOutDiscards, ifInErrors, and ifOutErrors, however, I don’t understand the IETF processes in this case. 3. Problem Statement The fundamental problem for network operators is how to automatically detect when unintended packet loss is occurring and determine the appropriate action to mitigate it. For any network, there are a small set of potential actions that can be taken to mitigate customer impact when unintended packet loss is detected: 1. Take a problematic device, link, or set of devices and/or links out of service. 2. Return a device, link, or set of devices and/or links back into service. 3. Move traffic to other links or devices to alleviate congestion or avoid problematic paths. 4. Roll back a recent change to a device that might have caused the problem. 5. Escalate to a network operator as a last resort when automated mitigation is not possible. PA: Is this intended to be an exhaustive list of all the reasons? I could think of others, eg update a device configuration or firmware, whether to fix a bug or to correctly classify the traffic so it's nolonger dropped. JE: Not meant to be exhaustive – will clarify. Typically, however, updating a device configuration or firmware would be a remediation action (to fix the underlying problem), rather than a mitigation action (to stop the current packet loss), however will update the text to be more permissive. The ability to select the appropriate mitigation action depends on four key features of packet loss: FEATURE-DISCARD-LOCATION: Determines which devices, interfaces and/ or flows are impacted. PA: "flows" isn't a location. It describes what is being dropped rather than where it's being dropped. JE: Yes – propose to change this to FEATURE-DISCARD-SCOPE instead FEATURE-DISCARD-RATE: The rate and/or magnitude of the discards, indicating the severity and urgency of the problem. PA: How is this measured, eg pps or % of bandwidth? Is it meaningful to compare rates on different interfaces or devices? eg, two interfaces may discard traffic at a rate of 40pps, but it's a more serious problem on the interface that's transmitting 50pps than on the interface that's transmitting 500pps. JE: It can be either depending on the class of discard. For example, for errored loss (which is not typically randomly distributed) the absolute rate is more meaningful, where for congestive loss (which is randomly distributed) the relative rate is more meaningful. Both are available or can be derived from the model. I will update the text to include both. FEATURE-DISCARD-DURATION: The duration of the discards which helps to distinguish transient from persistent issues. FEATURE-DISCARD-CLASS: The type or class of discards, which is crucial for selecting the appropriate of mitigation - for example: error discards may require taking faulty components out of service; no-buffer discards may require traffic redistribution; policy discards typically require no automated action PA: "policy discards" are intentional, whereas the Abstract, Introduction, and Problem Statement focus on unintended packet loss. JE: Will update the text to clarify 4.2. Sub-type Definitions PA: Nit: please consistently terminate each of the following sub-sections with a period. JE: ack discards/policy/: These are intended discards, meaning packets dropped by a device due to a configured policy, including: ACLs, traffic policers, Reverse Path Forwarding (RPF) checks, DDoS protection rules, and explicit null routes discards/error/: These are unintended discards due to errors in processing packets or frames. There are multiple sub-classes. discards/error/l2/rx/: These are frames discarded due to errors in the received Layer 2 frame, including: CRC errors, invalid MAC addresses, invalid VLAN tags, frame size violations and other malformed frame conditions Evans, et al. Expires 1 January 2026 [Page 9] Internet-Draft IM and DM for Packet Discard Reporting June 2025 discards/error/l3/rx/: These are discards which occur due to errors in the received packet, indicating an upstream problem rather than an issue with the device dropping the errored packets, including: header checksum errors, MTU exceeded, invalid packet errors, i.e., incorrect version, incorrect header length, invalid options and other malformed packet conditions discards/error/l3/rx/ttl-expired: These are discards due to TTL (or Hop limit) expiry, which can occur for the following reasons: normal trace-route operations, end-system TTL/Hop limit set too low, routing loops in the network. discards/error/l3/no-route/: These are discards which occur due to a packet not matching any route in the routing table, e.g., which may be due to routing configuration errors or may be transient discards during convergence. discards/error/internal/: These are discards due to internal device issues, including: parity errors in device memory or other internal hardware errors. Any errored discards not explicitly assigned to other classes are also accounted for here. discards/no-buffer/: These are discards due to buffer exhaustion, i.e. congestion related discards. These can be tail-drop discards or due to an active queue management algorithm, such as RED [RED93] or CODEL [RFC8289]. An example of possible signal-to-mitigation action mapping is provided in Appendix B. 4.3. "ietf-packet-discard-reporting-sx" YANG Module The "ietf-packet-discard-reporting-sx" module uses the "sx" structure defined in [RFC8791]. <CODE BEGINS> PA: FYI, "rfcstrip" seems to be inserting an extra blank line at each page break which I haven't noticed with other drafts/RFCs. 5.2. Implementation Requirements The following requirements apply to the implementation of the data model and are intended to ensure consistent implementation across different vendors and platforms while allowing for platform-specific optimisations where needed. While the model defines a comprehensive set of counters and statistics, implementations MAY support a subset of the defined features based on device capabilities and operational requirements. However, implementations MUST clearly document which features are supported and how they map to the model. Requirements 1-11 relate to packets forwarded or discarded by the device, while requirement 12 relates to packets destined for or originating from the device: 1. All instances of Layer 2 frame or Layer 3 packet receipt, transmission, and discards MUST be accounted for. PA: What if the problem is that the device has insufficient resource for this accounting? JE: Then it would not be compliant with the draft. As an operator, a device that silently drops packets and does not report them (i.e. a grey failure) due to insufficient resources (or any other reason) is an issue we would work to address. Hence this draft. 2. All instances of Layer 2 frame or Layer 3 packet receipt, transmission, and discards SHOULD be attributed to the physical or logical interface of the device where they occur. Where they cannot be attributed to the interface, they MUST be attributed to the device. 3. An individual frame MUST only be accounted for by either the Layer 2 traffic class or the Layer 2 discard classes within a single direction or context, i.e., ingress or egress or device. This is to avoid double counting. 4. An individual packet MUST only be accounted for by either the Layer 3 traffic class or the Layer 3 discard classes within a single direction or context, i.e., ingress or egress or device. This is to avoid double counting. Evans, et al. Expires 1 January 2026 [Page 26] Internet-Draft IM and DM for Packet Discard Reporting June 2025 5. A frame accounted for at Layer 2 SHOULD NOT be accounted for at Layer 3 and vice versa. An implementation MUST indicate which layers traffic and discards are counted against. This is to avoid double counting. PA: By what means MUST this be indicated? JE: Great question. I’m not sure if there’s any precedent for this type of information? We could add a template in comments in the model? 6. The aggregate Layer 2 and Layer 3 traffic and discard classes SHOULD account for all underlying frames or packets received, transmitted, and discarded across all other classes. 7. The aggregate QoS traffic and no-buffer discard classes MUST account for all underlying packets received, transmitted, and discarded across all other classes. 8. In addition to the Layer 2 and Layer 3 aggregate classes, an individual discarded packet MUST only account against a single error, policy, or no-buffer discard subclass. 9. When there are multiple reasons for discarding a packet, the ordering of discard class reporting MUST be defined. PA: How and where should it be defined? 10. If Diffserv [RFC2475] is not used, no-buffer discards SHOULD be reported as class[id="0"], which represents the default class. 11. When traffic is mirrored, the discard metrics MUST account for the original traffic rather than the reflected traffic. 12. Traffic to the device control plane has its own class. However, traffic from the device control plane MUST be accounted for in the same way as other egress traffic. PA: Wouldn't it be useful to account locally generated issues separately? 6.2. Data Model This section is modeled after the template described in Section 3.7 of [I-D.ietf-netmod-rfc8407bis]. The YANG module specified in Section 5.4 defines a data model that is designed to be accessed via YANG-based management protocols, such as NETCONF [RFC6241] and RESTCONF [RFC8040]. These YANG-based management protocols (1) have to use a secure transport layer (e.g., SSH [RFC4252], TLS [RFC8446], and QUIC [RFC9000]) and (2) have to use mutual authentication. PA: Is "have to use" a BCP 14 "MUST"? JE: Med? PA: Nit: something odd happens to the indentation below: JE: ack control-plane, interfaces, and devices: Access to these data nodes would reveal information about the attacks to which an element is subject, misconfigurations, etc. Also, an attacker who can inject packets can infer the efficiency of its attack by monitoring (the increase of) some discard counters (e.g., policy) and adjust its attack strategy accordingly. Appendix C. Implementation Experience This appendix captures practical insights gained from implementing this information model across multiple vendors' platforms, as guidance for future implementers. 1. The number and granularity of discard classes defined in the information model represent a compromise. It aims to provide sufficient detail to enable appropriate automated actions while avoiding excessive detail, which may hinder quick problem identification. Additionally, it helps to limit the quantity of data produced per interface, constraining the data volume and device CPU impacts. While further granularity is possible, the defined schema has generally proven to be sufficient for the task of mitigating unintended packet loss. 2. There are many possible ways to define the discard classification tree. For example, we could have used a multi- rooted tree, rooted in each protocol. Instead, we opted to define a tree where protocol discards and causal discard classes are accounted for orthogonally. This decision reduces the number of combinations of classes and has proven sufficient for determining mitigation actions. 3. NoBuffer discards can be realized differently with different memory architectures. Whether a NoBuffer discard is attributed to ingress or egress can differ accordingly. For successful auto-mitigation, discards due to egress interface congestion should be reported on egress, while discards due to device-level congestion (e.g. due to exceeding the device forwarding rate) should be reported on ingress. PA: Are those BCP-14 SHOULDs? JE: Yes – I’m thinking to move these to the requirements section 9. In cases where the reporting device is the source or destination of a tunnel, the ingress protocol for a packet may differ from the egress protocol (e.g., if IPv4 is tunneled over IPv6). Some implementations may attribute egress discards to the ingress protocol. PA: Attributing in this way could mislead diagnosis, so whether and when to do this should be discussed in more detail in the body of the document. JE: I think it depends. Our practical experience is this is the way most hw works today and – as an operator – this hasn’t caused us issues. Hence, I’m reticent to mandate a change. I think this is another case where the behaviour should be documented. Will also add something to the requirements, i.e. a SHOULD and document if not supported 10. While the classification tree is seven layers deep, a minimal implementation may only implement the top six layers. PA: Is that a BCP 14 "MAY" ? JE: No – will change to ‘might’ Amazon Data Services UK Limited. Registered in England and Wales with registration number 09959151 with its registered office at 1 Principal Place, Worship Street, London, EC2A 2FA, United Kingdom.
_______________________________________________ OPSAWG mailing list -- [email protected] To unsubscribe send an email to [email protected]
