[OPSAWG]Re: draft-ietf-opsawg-discardmodel-08 review

Evans, John Mon, 21 Jul 2025 08:07:12 -0700

Hi Paul,

Many thanks for your feedback.

Please see JE: inline

Cheers

John

From: "Aitken, Paul" <[email protected]>
Date: Tuesday 15 July 2025 at 21:39
To: "<[email protected]>" <[email protected]>
Cc: "[email protected]" 
<[email protected]>, opsawg <[email protected]>
Subject: [EXTERNAL] [OPSAWG]draft-ietf-opsawg-discardmodel-08 review

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.

This document seems related to the anomaly detection and incident management 
work currently happening in the NMOP WG.

Please see PA: inline.

Abstract

   This document defines an information model and a corresponding YANG
   data model for packet discard reporting.  The information model
   provides an implementation-independent framework for classifying
   packet loss to enable automated network mitigation of unintended
   packet loss.  The YANG data model specifies an implementation of this
   framework for network elements.

PA: Is the model not applicable to intentional packet loss, ie when traffic is 
dropped due to policy?

JE: The model separately classifies intended and unintended discards; the 
nuance was that this is so that we can mitigate unintended discards.  Will 
rephrase to make clearer.

1.  Introduction
   Existing metrics for reporting packet loss, such as ifInDiscards,
   ifOutDiscards, ifInErrors, and ifOutErrors defined in MIB-II
   [RFC2863] and the YANG Data Model for Interface Management [RFC8343],
   are insufficient for automating network operations.  First, they lack
   precision; for instance, ifInDiscards aggregates all discarded
   inbound packets without specifying the cause, making it challenging
   to distinguish between intended and unintended discards.  Second,
   these definitions are ambiguous, leading to inconsistent vendor
   implementations.  For example, in some implementations ifInErrors
   accounts only for errored packets that are dropped, while in others,
   it includes all errored packets, whether they are dropped or not.
   Many implementations support more discard metrics than these,
   however, they have been inconsistently implemented due to the lack of
   a standardised classification scheme and clear semantics for packet
   loss reporting.  For example, [RFC7270] provides support for
   reporting discards per flow in IPFIX using forwardingStatus, however,
   the defined drop reason codes also lack sufficient clarity to
   facilitate automated root cause analysis and impact mitigation, e.g.,
   the "For us" reason code.

PA: So do you propose to update or supersede those documents?

JE: Practically, I think the model should supersede the definitions of 
ifInDiscards, ifOutDiscards, ifInErrors, and ifOutErrors, however, I don’t 
understand the IETF processes in this case.

3.  Problem Statement

   The fundamental problem for network operators is how to automatically
   detect when unintended packet loss is occurring and determine the
   appropriate action to mitigate it.  For any network, there are a
   small set of potential actions that can be taken to mitigate customer
   impact when unintended packet loss is detected:

   1.  Take a problematic device, link, or set of devices and/or links
       out of service.

   2.  Return a device, link, or set of devices and/or links back into
       service.

   3.  Move traffic to other links or devices to alleviate congestion or
       avoid problematic paths.

   4.  Roll back a recent change to a device that might have caused the
       problem.

   5.  Escalate to a network operator as a last resort when automated
       mitigation is not possible.

PA: Is this intended to be an exhaustive list of all the reasons? I could think 
of others, eg update a device configuration or firmware, whether to fix a bug 
or to correctly classify the traffic so it's nolonger dropped.

JE: Not meant to be exhaustive – will clarify.  Typically, however, updating a 
device configuration or firmware would be a remediation action (to fix the 
underlying problem), rather than a mitigation action (to stop the current 
packet loss), however will update the text to be more permissive.

   The ability to select the appropriate mitigation action depends on
   four key features of packet loss:

   FEATURE-DISCARD-LOCATION:  Determines which devices, interfaces and/
      or flows are impacted.

PA: "flows" isn't a location. It describes what is being dropped rather than 
where it's being dropped.

JE: Yes – propose to change this to FEATURE-DISCARD-SCOPE instead

   FEATURE-DISCARD-RATE:  The rate and/or magnitude of the discards,
      indicating the severity and urgency of the problem.

PA: How is this measured, eg pps or % of bandwidth? Is it meaningful to compare 
rates on different interfaces or devices? eg, two interfaces may discard 
traffic at a rate of 40pps, but it's a more serious problem on the interface 
that's transmitting 50pps than on the interface that's transmitting 500pps.

JE: It can be either depending on the class of discard.  For example, for 
errored loss (which is not typically randomly distributed) the absolute rate is 
more meaningful, where for congestive loss (which is randomly distributed) the 
relative rate is more meaningful.  Both are available or can be derived from 
the model.  I will update the text to include both.

   FEATURE-DISCARD-DURATION:  The duration of the discards which helps
      to distinguish transient from persistent issues.

   FEATURE-DISCARD-CLASS:  The type or class of discards, which is
      crucial for selecting the appropriate of mitigation - for example:
      error discards may require taking faulty components out of
      service; no-buffer discards may require traffic redistribution;
      policy discards typically require no automated action

PA: "policy discards" are intentional, whereas the Abstract, Introduction, and 
Problem Statement focus on unintended packet loss.

JE: Will update the text to clarify

4.2.  Sub-type Definitions

PA: Nit: please consistently terminate each of the following sub-sections with 
a period.

JE: ack

   discards/policy/:  These are intended discards, meaning packets
      dropped by a device due to a configured policy, including: ACLs,
      traffic policers, Reverse Path Forwarding (RPF) checks, DDoS
      protection rules, and explicit null routes

   discards/error/:  These are unintended discards due to errors in
      processing packets or frames.  There are multiple sub-classes.

   discards/error/l2/rx/:  These are frames discarded due to errors in
      the received Layer 2 frame, including: CRC errors, invalid MAC
      addresses, invalid VLAN tags, frame size violations and other
      malformed frame conditions

Evans, et al.            Expires 1 January 2026                 [Page 9]
Internet-Draft   IM and DM for Packet Discard Reporting        June 2025

   discards/error/l3/rx/:  These are discards which occur due to errors
      in the received packet, indicating an upstream problem rather than
      an issue with the device dropping the errored packets, including:
      header checksum errors, MTU exceeded, invalid packet errors, i.e.,
      incorrect version, incorrect header length, invalid options and
      other malformed packet conditions

   discards/error/l3/rx/ttl-expired:  These are discards due to TTL (or
      Hop limit) expiry, which can occur for the following reasons:
      normal trace-route operations, end-system TTL/Hop limit set too
      low, routing loops in the network.

   discards/error/l3/no-route/:  These are discards which occur due to a
      packet not matching any route in the routing table, e.g., which
      may be due to routing configuration errors or may be transient
      discards during convergence.

   discards/error/internal/:  These are discards due to internal device
      issues, including: parity errors in device memory or other
      internal hardware errors.  Any errored discards not explicitly
      assigned to other classes are also accounted for here.

   discards/no-buffer/:  These are discards due to buffer exhaustion,
      i.e. congestion related discards.  These can be tail-drop discards
      or due to an active queue management algorithm, such as RED
      [RED93] or CODEL [RFC8289].

   An example of possible signal-to-mitigation action mapping is
   provided in Appendix B.

4.3.  "ietf-packet-discard-reporting-sx" YANG Module

   The "ietf-packet-discard-reporting-sx" module uses the "sx" structure
   defined in [RFC8791].

   <CODE BEGINS>

PA: FYI, "rfcstrip" seems to be inserting an extra blank line at each page 
break which I haven't noticed with other drafts/RFCs.

5.2.  Implementation Requirements

   The following requirements apply to the implementation of the data
   model and are intended to ensure consistent implementation across
   different vendors and platforms while allowing for platform-specific
   optimisations where needed.  While the model defines a comprehensive
   set of counters and statistics, implementations MAY support a subset
   of the defined features based on device capabilities and operational
   requirements.  However, implementations MUST clearly document which
   features are supported and how they map to the model.

   Requirements 1-11 relate to packets forwarded or discarded by the
   device, while requirement 12 relates to packets destined for or
   originating from the device:

   1.   All instances of Layer 2 frame or Layer 3 packet receipt,
        transmission, and discards MUST be accounted for.

PA: What if the problem is that the device has insufficient resource for this 
accounting?

JE: Then it would not be compliant with the draft.  As an operator, a device 
that silently drops packets and does not report them (i.e. a grey failure) due 
to insufficient resources (or any other reason) is an issue we would work to 
address.  Hence this draft.

   2.   All instances of Layer 2 frame or Layer 3 packet receipt,
        transmission, and discards SHOULD be attributed to the physical
        or logical interface of the device where they occur.  Where they
        cannot be attributed to the interface, they MUST be attributed
        to the device.

   3.   An individual frame MUST only be accounted for by either the
        Layer 2 traffic class or the Layer 2 discard classes within a
        single direction or context, i.e., ingress or egress or device.
        This is to avoid double counting.

   4.   An individual packet MUST only be accounted for by either the
        Layer 3 traffic class or the Layer 3 discard classes within a
        single direction or context, i.e., ingress or egress or device.
        This is to avoid double counting.

Evans, et al.            Expires 1 January 2026                [Page 26]
Internet-Draft   IM and DM for Packet Discard Reporting        June 2025

   5.   A frame accounted for at Layer 2 SHOULD NOT be accounted for at
        Layer 3 and vice versa.  An implementation MUST indicate which
        layers traffic and discards are counted against.  This is to
        avoid double counting.

PA: By what means MUST this be indicated?

JE: Great question.  I’m not sure if there’s any precedent for this type of 
information?  We could add a template in comments in the model?

   6.   The aggregate Layer 2 and Layer 3 traffic and discard classes
        SHOULD account for all underlying frames or packets received,
        transmitted, and discarded across all other classes.

   7.   The aggregate QoS traffic and no-buffer discard classes MUST
        account for all underlying packets received, transmitted, and
        discarded across all other classes.

   8.   In addition to the Layer 2 and Layer 3 aggregate classes, an
        individual discarded packet MUST only account against a single
        error, policy, or no-buffer discard subclass.

   9.   When there are multiple reasons for discarding a packet, the
        ordering of discard class reporting MUST be defined.

PA: How and where should it be defined?

   10.  If Diffserv [RFC2475] is not used, no-buffer discards SHOULD be
        reported as class[id="0"], which represents the default class.

   11.  When traffic is mirrored, the discard metrics MUST account for
        the original traffic rather than the reflected traffic.

   12.  Traffic to the device control plane has its own class.  However,
        traffic from the device control plane MUST be accounted for in
        the same way as other egress traffic.

PA: Wouldn't it be useful to account locally generated issues separately?

6.2.  Data Model

   This section is modeled after the template described in Section 3.7
   of [I-D.ietf-netmod-rfc8407bis].

   The YANG module specified in Section 5.4 defines a data model that is
   designed to be accessed via YANG-based management protocols, such as
   NETCONF [RFC6241] and RESTCONF [RFC8040].  These YANG-based
   management protocols (1) have to use a secure transport layer (e.g.,
   SSH [RFC4252], TLS [RFC8446], and QUIC [RFC9000]) and (2) have to use
   mutual authentication.

PA: Is "have to use" a BCP 14 "MUST"?

JE: Med?

PA: Nit: something odd happens to the indentation below:

JE: ack

   control-plane, interfaces, and devices:  Access to these data nodes
      would reveal information about the attacks to which an element is
      subject, misconfigurations, etc.

      Also, an attacker who can inject packets can infer the efficiency
      of its attack by monitoring (the increase of) some discard
      counters (e.g., policy) and adjust its attack strategy
      accordingly.

Appendix C.  Implementation Experience

   This appendix captures practical insights gained from implementing
   this information model across multiple vendors' platforms, as
   guidance for future implementers.

   1.   The number and granularity of discard classes defined in the
        information model represent a compromise.  It aims to provide
        sufficient detail to enable appropriate automated actions while
        avoiding excessive detail, which may hinder quick problem
        identification.  Additionally, it helps to limit the quantity of
        data produced per interface, constraining the data volume and
        device CPU impacts.  While further granularity is possible, the
        defined schema has generally proven to be sufficient for the
        task of mitigating unintended packet loss.

   2.   There are many possible ways to define the discard
        classification tree.  For example, we could have used a multi-
        rooted tree, rooted in each protocol.  Instead, we opted to
        define a tree where protocol discards and causal discard classes
        are accounted for orthogonally.  This decision reduces the
        number of combinations of classes and has proven sufficient for
        determining mitigation actions.

   3.   NoBuffer discards can be realized differently with different
        memory architectures.  Whether a NoBuffer discard is attributed
        to ingress or egress can differ accordingly.  For successful
        auto-mitigation, discards due to egress interface congestion
        should be reported on egress, while discards due to device-level
        congestion (e.g. due to exceeding the device forwarding rate)
        should be reported on ingress.

PA: Are those BCP-14 SHOULDs?

JE: Yes – I’m thinking to move these to the requirements section

   9.   In cases where the reporting device is the source or destination
        of a tunnel, the ingress protocol for a packet may differ from
        the egress protocol (e.g., if IPv4 is tunneled over IPv6).  Some
        implementations may attribute egress discards to the ingress
        protocol.

PA: Attributing in this way could mislead diagnosis, so whether and when to do 
this should be discussed in more detail in the body of the document.

JE: I think it depends.  Our practical experience is this is the way most hw 
works today and – as an operator – this hasn’t caused us issues.  Hence, I’m 
reticent to mandate a change.  I think this is another case where the behaviour 
should be documented.  Will also add something to the requirements, i.e. a 
SHOULD and document if not supported

   10.  While the classification tree is seven layers deep, a minimal
        implementation may only implement the top six layers.

PA: Is that a BCP 14 "MAY" ?

JE: No – will change to ‘might’

Amazon Data Services UK Limited. Registered in England and Wales with 
registration number 09959151 with its registered office at 1 Principal Place, 
Worship Street, London, EC2A 2FA, United Kingdom.

_______________________________________________
OPSAWG mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[OPSAWG]Re: draft-ietf-opsawg-discardmodel-08 review

Reply via email to