[OPSAWG] AD Review of draft-ietf-opsawg-service-assurance-architecture-09

Rob Wilton (rwilton) Mon, 17 Oct 2022 05:00:00 -0700

Hi authors,

Here is my AD review of draft-ietf-opsawg-service-assurance-architecture-09.  I 
would like to thank you and the WG for this document.  I believe that this 
architecture document, and the corresponding YANG document, offer a good 
flexible basis to help with the full lifecycle monitoring of deployed services.


Here are my comments which may help improve the document.


Minor level comments:

(1) p 3, sec 1.  Introduction

   The assurance graph of a service is decomposed into components, which
   are then assured independently.  The root of the assurance graph
   represents the service to assure, and its children represent
   components identified as its direct dependencies; each component can
   have dependencies as well.  Components involved in the assurance
   graph of a service are called subservices.  The SAIN orchestrator
   updates automatically the assurance graph when services are modified.

I was wondering if you meant services or subservices in the last sentence?


(2) p 3, sec 1.  Introduction

   When a service is degraded, the SAIN architecture will highlight
   where in the assurance service graph to look, as opposed to going hop
   by hop to troubleshoot the issue.  More precisely, the SAIN
   architecture will associate to each service a list of symptoms
   originating from specific subservices, corresponding to components of
   the network.  These components are good candidates for explaining the
   source of a service degradation.  Not only can this architecture help
   to correlate service degradation with network root cause/symptoms,
   but it can deduce from the assurance graph the number and type of
   services impacted by a component degradation/failure.  This added
   value informs the operational team where to focus its attention for
   maximum return.  Indeed, the operational team should focus his
   priority on the degrading/failing components impacting the highest
   number customers, especially the ones with the SLA contracts
   involving penalties in case of failure.

Rather than "should focus", perhaps "may focus" or "are likely to focus".  
Also, his => their, number customers -> number of customers


(3) p 4, sec 2.  Terminology

   SAIN agent: A functional component that communicates with a device, a
   set of devices, or another agent to build an expression graph from a
   received assurance graph and perform the corresponding computation of
   the health status and symptoms.

Perhaps consider whether stating that the SAIN agent could run directly on the 
device?  Although I noted that this is described later in the document anyway.


(4) p 5, sec 2.  Terminology

   Metric: An information retrieved from the network running the assured
   service.

Suggest An item of information retrieved, or a piece of data retrieved.


(5) p 5, sec 2.  Terminology

   Health score: Integer ranging from 0 to 100 indicating the health of
   a subservice.  A score of 0 means that the subservice is broken, a
   score of 100 means that the subservice in question is operating as
   expected.

I noted that neither the architecture nor the YANG talk about mapping discrete 
properties (e.g., interface up/down). I would intuitively think that these 
would map to a value of 100 and 0 respectively.  Would it be helpful to add any 
text to describe how binary properties are handled?


(6) p 6, sec 3.  A Functional Architecture

   The goal of SAIN is to assure that service instances are operating as
   expected (i.e. the observed service is matching the expected service)
   and if not, to pinpoint what is wrong.  More precisely, SAIN computes
   a score for each service instance and outputs symptoms explaining
   that score.  Symptoms explain the score.  The only valid situation
   where no symptoms are returned is when the score is maximal,
   indicating that no issues where detected for that service.  The score
   augmented with the symptoms is called the health status.

Symptoms explain the score seems to be a duplicate and can be removed.


(7) p 6, sec 3.  A Functional Architecture

   The SAIN architecture is a generic architecture, applicable to
   multiple environments (e.g. wireline, wireless), but also different
   domains (e.g. 5G network function virtualization (NFV) domain with a
   virtual infrastructure manager (VIM)), etc.  And as already noted,
   for physical or virtual devices, as well as virtual functions.
   Thanks to the distributed graph design principle, graphs from
   different environments/orchestrator can be combined together.

perhaps: combined together -> combined together for a given service.


(8) p 7, sec 3.  A Functional Architecture

          +-----------------+
          | Service         |
          | Orchestrator    |<--------------------+
          |                 |                     |
          +-----------------+                     |
             |            ^                       |
             |            | Network               |
             |            | Service               | Feedback
             |            | Instance              | Loop
             |            | Configuration         |
             |            |                       |
             |            V                       |
             |        +-----------------+       +-------------------+
             |        | SAIN            |       | SAIN              |
             |        | Orchestrator    |       | Collector         |
             |        +-----------------+       +-------------------+
             |            |                        ^
             |           Y| Configuration          | Health Status
             |            | (assurance graph)     Y| (Score + Symptoms)
             |            V                        | Streamed
             |     +-------------------+           | via Telemetry
             |     |+-------------------+          |
             |     ||+-------------------+         |
             |     +|| SAIN              |---------+
             |      +| agent             |
             |       +-------------------+
             |               ^ ^ ^
             |               | | |
             |               | | |  Metric Collection
             V               V V V
         +-------------------------------------------------------------+
         |           Network System                                    |
         |                                                             |
         +-------------------------------------------------------------+

I was slightly surprised that there no line/arrow flowing from the SAIN 
Collector to the SAIN Orchestrator.  Is it ever possible that the assurance 
graph could change dynamically, e.g., on the basis of moving to a backup path 
for a tunnel, could that then mean that different subservices are added into 
the assurance graph?  Or would the expectation always be that these are known 
and setup statically.


(9) p 8, sec 3.1.  Inferring a Service Instance Configuration into an Assurance 
Graph

   *  Subservices that are common to several service instances are
      reused for reducing the amount of computation needed.

Is a subservice allowed to be decomposed into further subservices?


(10) p 15, sec 3.2.  Intent and Assurance Graph

   *  Capturing the intent would start by detecting that the service
      instance is actually a tunnel between the two devices, and stating
      that this tunnel must be functional.  This solution is minimally
      invasive as it does not require to modify nor know the service
      model.  If the service model or network model is known by the SAIN
      orchestrator, it can be used to further capture the intent and
      include more information such as SLO.  For instance, the latency
      and bandwidth requirements for the tunnel, if present in the
      service model

SLO.  What does this expand to?


(11) p 16, sec 3.4.  Building the Expression Graph from the Assurance Graph

      Impacting Dependency: Type of dependency whose score impacts the
      score of its parent subservice or service instance(s) in the
      assurance graph.  The symptoms are taken into account in the
      parent service instance or subservice instance(s), as the
      impacting reasons.

If a VPN service had a proscribed primary and backup path, and a node failed on 
the primary path then would you expect that to cause the health score for the 
service to decrease (even if there is no client visible impact)?


(12) p 19, sec 3.7.  Flexible Functional Architecture

   The SAIN architecture is flexible in terms of components.  While the
   SAIN architecture in Figure 1 makes a distinction between two
   components, the SAIN configuration orchestrator and the SAIN
   orchestrator,

Should this be SAIN orchestrator and the SAIN collector?



Nit level comments:

(13) p 1, sec 1.  Introduction

   Service orchestrators use Network service YANG modules that will
   infer network-wide configuration and, therefore the invocation of the
   appropriate device modules (Section 3 of [RFC8969]).  Knowing that a
   configuration is applied doesn't imply that the service is up and
   running as expected.  For instance, the service might be degraded
   because of a failure in the network, the experience quality is
   distorted, or a service function may be reachable at the IP level but

Suggest "the service quality may be degraded" instead of "the experience 
quality is distorted".


(14) p 2, sec 1.  Introduction

   Service orchestrators use Network service YANG modules that will
   infer network-wide configuration and, therefore the invocation of the
   appropriate device modules (Section 3 of [RFC8969]).  Knowing that a
   configuration is applied doesn't imply that the service is up and
   running as expected.  For instance, the service might be degraded
   because of a failure in the network, the experience quality is
   distorted, or a service function may be reachable at the IP level but
   does not provide its intended function.  Thus, the network operator
   must monitor the service operational data at the same time as the
   configuration (Section 3.3 of [RFC8969]).  To feed that task, the
   industry has been standardizing on telemetry to push network element
   performance information.

Suggest: service operational data -> service's operational data.


(15) p 2, sec 1.  Introduction

   A network administrator needs to monitor their network and services
   as a whole, independently of the management protocols.  With
   different protocols come different data models, and different ways to
   model the same type of information.  When network administrators deal
   with multiple management protocols, the network management entities
   have to perform the difficult and time-consuming job of mapping data
   models: e.g. the model used for configuration with the model used for
   monitoring when separate models or protocols are used.  This problem
   is compounded by a large, disparate set of data sources (MIB modules,
   YANG models [RFC7950], IPFIX information elements [RFC7011], syslog
   plain text [RFC5424], TACACS+ [RFC8907], RADIUS [RFC2865], etc.).  In
   order to avoid this data model mapping, the industry converged on
   model-driven telemetry to stream the service operational data,
   reusing the YANG models used for configuration.  Model-driven
   telemetry greatly facilitates the notion of closed-loop automation
   whereby events/status from the network drive remediation changes back
   into the network.

e.g. => e.g.., I also suggest: "whereby events/status from the" -> "whereby 
events and updated operational state streamed from the"


(16) p 6, sec 3.  A Functional Architecture

   The goal of SAIN is to assure that service instances are operating as
   expected (i.e. the observed service is matching the expected service)
   and if not, to pinpoint what is wrong.  More precisely, SAIN computes
   a score for each service instance and outputs symptoms explaining
   that score.  Symptoms explain the score.  The only valid situation
   where no symptoms are returned is when the score is maximal,
   indicating that no issues where detected for that service.  The score
   augmented with the symptoms is called the health status.

Nit, i.e. => i.e.,


(17) p 8, sec 3.1.  Inferring a Service Instance Configuration into an 
Assurance Graph

Possibly "Translating" rather than "Inferring"?


(18) p 20, sec 3.7.  Flexible Functional Architecture

   And finally, the SAIN architecture is flexible in terms of what it
   monitors.  Most, if not all examples, in this document refer to
   physical components but this is not a constrain.  Indeed, the
   assurance of virtual components would follow the same principles and
   an assurance graph composed of virtualized components (or a mix of
   virtualized and physical ones) is well possible within this
   architecture.

constrain -> constraint. well possible -> very possible.



Nits grammar checks from automated tool, some may not be valid ...

Spellings:
under-maintenace,

Grammar Warnings:
Section: 1, draft text:
However some already defined services might have been designed using a 
different approach. 
Warning:  Did you forget a comma after a conjunctive/linking adverb?
Suggested change:  "However,"

Section: 2, draft text:
 SAIN collector: A functional component that fetches or receives the 
computer-consumable output of the SAIN agent(s) and process it locally 
(including displaying it in a user friendly form). 
Warning:  This word is normally spelled with hyphen.
Suggested change:  "user-friendly"

Section: 3, draft text:
Thanks to the distributed graph design principle, graphs from different 
environments/orchestrator can be combined together. 
Warning:  'combined together' is redundant. Use combined
Suggested change:  "combined"

Section: 3.2, draft text:
This solution is minimally invasive as it does not require to modify nor know 
the service model. 
Warning:  Did you mean modifying? Or maybe you should add a pronoun? In active 
voice, 'require' + 'to' takes an object, usually a pronoun.
Suggested change:  "modifying"

Section: 3.4, draft text:
 Subservices shall be not be dependent on the protocol used to retrieve the 
metrics. 
Warning:  Consider using a past participle here: been.
Suggested change:  "been"

Section: 3.4, draft text:
 In order to keep subservices independent from metric collection method, or, 
expressed differently, to support multiple combinations of platforms, OSes, and 
even vendors, the architecture introduces the concept of "metric engine". 
Warning:  The usual collocation for "independent" is "of" not "from". Did you 
mean independent of?
Suggested change:  "independent of"

Section: 3.7, draft text:
Examples includes a DHCP server on a Linux server, a data plane, an IPFIX 
export, etc. 
Warning:  Possible agreement error.
Suggested change:  "Example includes"

Section: 3.7, draft text:
Exactly like a DHCP server/data plane/IPFIX export can be considered as 
subservices for a device, exactly like a routing instance can be considered as 
a subservice for a L3VPN, exactly like a tunnel can considered as a subservice 
for an application in the cloud. 
Warning:  The modal verb 'can' requires the verb's base form.
Suggested change:  "consider"

Section: 3.7, draft text:
Most, if not all examples, in this document refer to physical components but 
this is not a constrain. 
Warning:  The word 'constrain' is a verb. Did you mean the noun constraint?
Suggested change:  "constraint"

Section: 3.8, draft text:
 Therefore, the SAIN agent contains a YANG object specifying the date and time 
at which the symptoms history starts for the subservice instances. 
Warning:  Apostrophe might be missing.
Suggested change:  "symptoms'"

Section: 3.9, draft text:
Therefore an assurance graph version must be maintained, along with the date 
and time of its last generation. 
Warning:  Did you forget a comma after a conjunctive/linking adverb?
Suggested change:  "Therefore,"

Section: 4, draft text:
 If a closed loop system relies on this architecture then the well known issue 
of those system also applies, i.e., a lying device or compromised agent could 
trigger partial reconfiguration of the service or network. 
Warning:  Did you mean this system or those systems?
Suggested change:  "this system"

Section: 4, draft text:
The SAIN architecture neither augments or reduces this risk. 
Warning:  Use nor with neither.
Suggested change:  "nor"

Regards,
Rob

_______________________________________________
OPSAWG mailing list
OPSAWG@ietf.org
https://www.ietf.org/mailman/listinfo/opsawg

[OPSAWG] AD Review of draft-ietf-opsawg-service-assurance-architecture-09

Reply via email to