Hi Authors,
Thanks for working on this document. Thanks also to reviewers who have reviewed
this document and helped improve it.
Please find here some comments that I would like to discuss with the authors.
Hope they go towards improving the draft.
Section 1, paragraph 0
> Network platforms use Network Telemetry, such as YANG-Push [RFC8641],
> to continuously stream information, including both counters and state
> information.
Thanks to Med for performing the OPSDIR revew. I do not see a response from the
authors (at least on the mailing list). Have those points been addressed?
Thanks also to Joe for the shepherd writeup. He (and others) have brought up
the question or normative and non-normative YANG modules being mixed in the
draft. It should be captured in the shepherd's writeup. Regardless, the
combination of normative and non-normative in the normative section of the
document is problematic. All non-normative text should be moved to the Appendix.
Section 1, paragraph 0
> One part of this information is the actual collection period, as
> opposed to the configured collection period. On some platforms, that
> period can be adjusted automatically by the platform, for instance to
> reduce the load incurred by sending the telemetry. To later exploit
> the collected data, getting this actual collection period is crucial.
> This document defines a YANG model augmenting the YANG-Push model
> [RFC8641] to expose the actual collection period in Section 4.
If the server cannot satisfy the client's suggested collection period, should
it not just reject the session? Are you suggesting that servers can change the
collection period on the fly after a session is established? Should the server
not report that as an error or close the session if it cannot satisfy what was
negotiated as part of the session?
Section 1, paragraph 7
> To retrieve the context in which a particular piece of data was
> collected, three elements are necessary: the time of data emission,
> the originating platform and the subscription through which the data
> arrived. The approach described in this document delegates the time
> retrieval to the database storing the collected telemetry and
> focusing on providing a way to match a platform and a subscription
> identifier to the collection context. This is consistent with most
> of the YANG modules for devices, which focus on describing the
> current state of the device, rather than the evolution of that state
> through time.
It is not clear why the document delegates the time retrieval to the database
storing the collected telemetry. Is this the database at the source or the
database where the data is being collected?
Moreover, while the motivation for matching the platform id to the subscription
is clear, the rationale behind the statement following that is not clear. Can
this be explained differently, or dropped if it does not provide motivation for
the work within the draft?
Section 2, paragraph 2
> Platform manifest: part of the data manifest that completely
> characterizes the platform producing the telemetry information.
Isn't "*part* of the data manifest that *completely* characterizes” a
contradiction? Can this be reworded?
Section 2, paragraph 2
> Data collection manifest: part of the data manifest that completely
> characterizes how and when the telemetry information was metered.
Same here.
Section 3.1, paragraph 4
> From a collection parameters point of view, the data scientists
> analyzing the collected data must know whether the counter was
> requested from the network platform as on-change or at specific
> cadence [RFC8641]. Indeed, an on-change collection explains why
> there is a single value as opposed to a time series. In case of
> periodic collection, this exact cadence might not be observable in
> the time series. Indeed, this time series might report some values
> as 0 or might even omit some values. The reason for this behavior
> might be diverse: the network platform may be under stress, with a
> too small observation period, compared to the minimum-observed-
> period. Knowing the conditions under which a counter was collected
> and streamed (along with the platform details) helps drawing the
> informed conclusions. As an example, some platform might report a
> value of 0 for counters when the collection period is too short with
> respect to the capabilities of the platform. Without context, this
> value of 0 might lead to a wrong conclusion that the corresponding
> counter dropped to zero.
When I read the term "too short", I understand it as a limitation that prevents
the device from reading the value correctly. Is that what is meant in this
case? Which also brings up the question of why the collection period is defined
in centiseconds? What use case requires data to be collected in centiseconds or
even milliseconds?
In particular, I wonder how the reporting of 0 value for the counter is
different between the time period being too short, or when the platform wants
to report a count of 0, because the counter was not incremented?
Section 3.2, paragraph 0
> When a new device is onboarded, operators must check that the new
> device streams data (e.g., with YANG-Push), that the Network
> Telemetry data is the right one, that the data is correctly collected
> at the data collection, and finally that the data can be analyzed
> (compared with other similar devices). For the last point, the data
> manifest, which must be linked to the data up to the collection and
> analytics system, contains the relevant information.
What is "relevant information" in this case? The collector is the one
requesting the data. Therefore, it knows which device is supposed to send the
data. Is this a check to see that the name/id of the device that the collector
requested data from matches the name/id in the data manifest? If so, can that
be made more apparent?
Section 4, paragraph 5
> "This module augments ietf-subscribed-notification and
> ietf-yang-push with the current-period statistics reporting the
> actual collection period, as opposed to the configured one.
This module seems not to be adding current-period statistics. It is adding the
current collection period.
Section 4, paragraph 8
> description
> "Adds current period statistics";
Same here.
Section 5.1, paragraph 3
> The YANG module contains a list of platform manifests (in 'platforms/
> platform'), indexed by the identifier of the platform. That
> identifier should be defined by the network manager so that each
> platform emitting Network Telemetry has a unique identifier. There
> are several documents about managing the inventory of the network,
> e.g., [I-D.ietf-ivy-network-inventory-yang]. The platform identifier
> should be the same as the identifier used in inventories or the
> 'node-id' in [RFC8345]. As an example, the identifier could be the
> 'sysName' from [RFC3418]. The scope of the "ietf-platform-manifest"
> module is the scope of the data collection, i.e., a given network,
> therefore it contains a collection of platform manifests, as opposed
> to the device scope, which would contain a single platform manifest.
Thanks to Qiufang Ma for her YANG doctors' review. She makes a few points that
are worth considering, which I will reiterate later.
The last few statements are clear on the scope of the model, i.e., a given
network and not a device. If that is the case, I am curious to know how the
data nodes in this module get populated. If the idea is for the module to be
implemented both at a network level (as a list) and at a device level (as a
single entry in the list), the statement above on the scope needs to be
modified.
Thanks for recognizing that some of the nodes in this module should be the same
as the nodes/definitions in other modules, such as platform/id, name,
software-version, etc. If that is the case, should they not be leafref to nodes
in the network inventory module or the hardware module defined in RFC8345? How
are the nodes defined differently from the nodes defined in that module?
Section 5.2, paragraph 4
> "This module describes the platform information to be used as
> context of data collection from a given network element. The
> contents of this model must be streamed along with the data
> streamed from the network element so that the platform context
> of the data collection can be retrieved later.
Why later? Isn't the idea that the collector wants to correlate the data being
streamed to the platform manifest as it is being written to a time series
database or being analyzed? Should it not be collected as part of establishing
the session?
Section 5.2, paragraph 4
> The data content of this model should not change except on
> upgrade or patching of the device.
Thanks for bringing this up. If a device is "patched", by which I understand
its software is updated, why should the data content change? Or do you mean the
device is being replaced?
Section 5.2, paragraph 7
> leaf name {
> type string {
> length "1..1023";
> }
> description
> "Model of the platform from which data is collected.";
> }
Why constrain the length of the string? Most models have shied away from
putting any constraints on the name and leaving it to the implementation in
question. My suggestion would be to provide a rationale for the constraint or
remove it from this and all the nodes that have it.
Section 5.2, paragraph 7
> description
> "Top container including all platforms in scope. If this model
> is hosted on a single device, it should contain a single entry
> in the list. At the network level, it should contain an entry
> for every monitored platform.";
This description is what I alluded to above, and seems to contradict the
statement above, mainly:
The scope of the "ietf-platform-manifest" module is the scope of the data
collection, i.e., a given network, therefore it contains a collection of
platform manifests, as opposed to the device scope, which would contain a
single platform manifest.
If the idea is that this module needs to be implemented both at a network level
(as a list of platforms) AND at a device level (as a single entry in the list),
the above statement needs to be modified to make it clear.
Section 6, paragraph 1
> 6. Data Collection Manifest
>
> This section is non-normative.
Thanks to Qiufang Ma for her YANG doctors review and Joe who brings up the
point in his Shepherd review. Both bring up a valid point on this particular
module. If this section is not normative, then this section belongs in the
Appendix.
Section 13, paragraph 1
> * Do we want to handle the absence of values, i.e. add information
> about missed collection or errors in the collection context ? It
> could also explain why some values are missing. On the other
> hand, this might also be out scope. CLOSED: the goal of the
> manifest is to be able to detect miscollection a posteriori.
> Assurance of the metric collection is out of scope and could be
> done via an external mechanism such as SAIN.
Why is information on missed collection out of scope of this document? If the
system is allowed to change the collection period if it "too short", can it not
report an error when it is not able to because of it?
No reference entries found for these items, which were mentioned in the text:
[draft-opsarea-rfc5706bis].
Found terminology that should be reviewed for inclusivity; see
https://www.rfc-editor.org/part2/#inclusive_language for background and more
guidance:
* Terms "natively" and "native"; alternatives might be "built-in",
"fundamental", "ingrained", "intrinsic", "original"
-------------------------------------------------------------------------------
NIT
-------------------------------------------------------------------------------
All comments below are about very minor potential issues that you may choose to
address in some way - or ignore - as you see fit. Some were flagged by
automated tools (via https://github.com/larseggert/ietf-reviewtool), so there
will likely be some false positives. There is no need to let me know what you
did with these suggestions.
Section 4, paragraph 0
+ s/current collection/current collection period/
Section 2, paragraph 2
> Platform: equipment of the network able to produce telemetry data.
s/equipment of the network/equipment in the network/
Section 3.1, paragraph 2
> * This counter definition, typically as defined in the YANG model.
s/This counter/The counter/
Section 4, paragraph 0
> Some platforms will adjust the collection period depending on their
> capabilities and current load. The YANG module in this section
> augments the "ietf-subscribed-notification" module to provide the
> "current-period" leaf. The value of this leaf indicates the current
> collection which might be different from the configured collection
> period.
s/current collection/current collection period/
Mahesh Jethanandani
[email protected]
_______________________________________________
OPSAWG mailing list -- [email protected]
To unsubscribe send an email to [email protected]