Re: [Lsr] A review of draft-ietf-lsr-isis-ttz

Adrian Farrel Wed, 24 Feb 2021 06:33:28 -0800

Hey Huaimo,


Wow! What a lot of work on the new revision. Thanks for the effort and the
quick turn-around.

 

The only thing I am struggling with is the metrics in the node abstraction
case. I still can't see how the nodes outside the domain correctly compute
the paths across the virtual node.

 

You are saying, "Some of the routes may not be optimal after the
abstraction." Perhaps this is enough (after all, we don't achieve multi-AS
shortest path routing), but it seems a very big change in behavior compared
to how the area operated without the TTZ. The "suboptimality" may (will?)
attract traffic to the TTZ and will substantially change the balance of
traffic in the network.

 

This ought, at least, to come with advice to operators that they should
carefully reconsider all of their metrics after introducing a TTZ.

 

Cheers,

Adrian

 

 

From: Huaimo Chen <huaimo.c...@futurewei.com> 
Sent: 24 February 2021 02:02
To: 'lsr' <lsr@ietf.org>; adr...@olddog.co.uk
Cc: draft-ietf-lsr-isis-...@ietf.org
Subject: Re: A review of draft-ietf-lsr-isis-ttz

 

Hi Adrian, 

 

    Thank you very much for your valuable comments.

    My answers/explanations are inline below with prefix [HC].

 

Best Regards,

Huaimo on behalf of authors

 

 

From: Adrian Farrel <adr...@olddog.co.uk <mailto:adr...@olddog.co.uk> >

Sent: Saturday, February 13, 2021 3:34 PM

To: 'lsr' <lsr@ietf.org <mailto:lsr@ietf.org> >

Cc: draft-ietf-lsr-isis-...@ietf.org
<mailto:draft-ietf-lsr-isis-...@ietf.org>  <draft-ietf-lsr-isis-...@ietf.org
<mailto:draft-ietf-lsr-isis-...@ietf.org> >

Subject: A review of draft-ietf-lsr-isis-ttz

 

Hi all,

 

Acee leant on me to do a review of this work (so blame him :-)

 

It's good to see this document adopted and progressing. Particularly

good to see the realistic compromise of making this Experimental.

 

I have a few comments, below.

 

Best,

Adrian

 

===

 

I have a largish issue with the fact that the document offers a choice

of how to aggregate the zone: virtual node or full mesh. Firstly, it is

not helpful to offer options without guidance about which option to pick

if you're an implementer or a deployer. You also need to specify whether

the choice MUST be a configuration option, and how to handle when some

nodes in the zone think one option and the others think the other

option.

[HC]: Added the advantages and disadvantages of two choices into the 

document, which may help an implementer or a deployer.

 

Possibly you can make this part of the experiment (see below for notes

on the experiment).

 

I have some pretty strong opinions on the idea of a single node

abstraction. The main challenge comes when there is a partial failure in

the zone such that the zone is partitioned (or the path between two

zone neighbors across the zone is severely degraded). It is not possible

to represent this in the node model since your only options are:

- drop the connection to a neighbor

- move to represent the zone as two nodes

[HC]: To resolve the partition of a zone is challenging. One possible

solution is that when a zone is partitioned, it is abstracted as two

virtual nodes. One (the first) part of the zone is abstracted as one 

(the first) virtual node, the other (second) part (which is disconnected

from the first part through zone links) is abstracted as another (second)

virtual node. 

 

In fact, both models (node and mesh) are subject to disruption when

there is a connectivity failure within the zone, but if we think about

the mesh model, it doesn't actually need to be advertised as a full

mesh: partial mesh is easily handled. Nevertheless, the use of a single

zone leader to perform the aggregation has problems if the zone is

partitioned in some way - perhaps this is addressed by the partitioned

zone simply electing two distinct leaders and declaring itself as two

zones.

[HC]: When a partial mesh is used, some of routes may not be optimal

after a zone is abstracted as a partial mesh among the zone edges.

When a zone is abstracted as a full mesh of zone edges, the routes

keep unchanged. The routes that are optimal before the abstraction

are still optimal after the abstraction. 

For node model, a zone is abstracted as a single virtual node.

When there is a connectivity failure within the zone, the failure

is not seen from any node outside of the zone. The routes computed

in any node outside of the zone will not change. 

For mesh model, a zone is abstracted as a full mesh of zone edges.

Some of the routes will change. The route changes are consistent.

 

This discussion of faults within the zone seems (to me) to be pretty

important.

 

I am also struggling with metrics and route computation when the zone is

viewed from outside the zone.  4.1.5 tells us about route computation,

but it is not until 4.3.1 that we discover:

   The

   metric to the neighbor is the metric of the shortest path to the edge

   node within the zone.

This text applies to the full mesh case, and we don't have anything

about the node model, so we might assume that the metrics on the edge

circuits are unchanged.

[HC]: Added forward pointers accordingly.

For the node model, every node outside of the zone has no change

on the metrics; every node inside the zone sees the metric of a link 

outside of the zone is one order of magnitude larger than the metric 

of a link inside the zone.

 

Obviously, this is important, and it feels that something is broken for

the virtual node case. Consider Figure 1.

 

Without the zone (and assuming link metrics of 1), the cost of the path

R15-R61-R71-R67-R31 is 4, and this route might not be preferred if some

other route R15-x-y-R31 exists with cost 3. However, once we have

introduced the zone using the virtual node approach, there is an

available route R15-Rz-R31 that appears to have a preferable metric of

2. I would say that the route R15-x-y-R31 should still be preferred.

[HC]: Added some text about this.

After a zone is abstracted as a single virtual node, some routes

will be changed since the block of an area (zone) becomes a single

node. Some of the routes may not be optimal after the abstraction. 

 

This point certainly needs to be called out in the text, and maybe this

gives some input to the choice between models. Perhaps the metrics in

the ISN and ESN TLVs are related to this point, but section 4.2.1 gives

no hint about how to set these values. Actually, I suspect that what is

going on here is that all of the metrics advertised to outside the zone

are controlled by the zone leader and advertised in the ISN/ESN - but I

don't find that actually stated anywhere.

[HC]: Added some text about this.

Node model has a higher abstraction rate than mesh model. 

The mesh model does not scale when the number of edge nodes of a zone

is large.

The mesh model keeps the routes unchanged. After a zone is abstracted

as the full mesh of the edges of the zone, every route is still

optimal. The TLVs are not used to advertise anything inside a zone

to outside of the zone. They are used to indicate the zone links 

of a zone edge node and are used by the zone nodes.

For the node model, nothing inside a zone is advertised outside

except for some prefixes inside the zone. 

 

 

All this said, I find it notable that this document focusses almost

completely (sections 4 and 5 - section 4.3 is a very small section) on

the virtual node model. It would be good to provide an example like

Figure 2, but for the mesh model.

[HC]: Added an example with a figure accordingly.

 

Perhaps rather than deferring this to be an outcome of the experiment,

this document should spend some time comparing the two models *or* it

might even be time to abandon one of the models.

[HC]: Added the text comparing the two models.

 

---

 

Obviously, at some point before this goes forward for publication,

you'll need to reduce to no more than five front-page authors.

[HC]: Will reduce to five.

 

---

 

I think the Abstract might usefully mention IS-IS. Probably the first

sentence could read:

 

   This document specifies a topology-transparent zone in an IS-IS area.

[HC]: Updated the document according to your suggestion.

---

 

The document really needs a section to scope the Experiment.

[HC]: Added a section for this.

 

- How is the experiment kept separate and safe from the Internet or

  indeed from any non-participating routers?

[HC]: A new TLV (called Zone ID TLV) is defined for TTZ.

      Any router that does not support TTZ (or non-participating router)

      and is outside of a TTZ zone will ignore this TLV. 

 

- What happens if the boundary of the experiment are breached?

  (To expand on this, what happens if there is a misconfiguration so

   that a Zone Internal Node thinks its neighbor is also in the Zone

   when it is actually unaware of these extensions and should be

   treated as a Zone External Node? This misconfiguration has a node

   that should be a Zone Edge/Border Node acting as a Zone Internal

   Node.)

[HC]: When there is a misconfiguration on a zone (a block of an area

      not using TTZ), a zone should not be transformed to a virtual

      node. A misconfiguration of a Zone Edge/Border Node to a 

      Zone Internal node can be detected automatically.  

      Every adjacent node of a Zone Internal node is a Zone node 

      and has the same zone ID. When the Zone Internal node 

      detects that one of its adjacent nodes is not a zone node,

      it should alarm the misconfiguration.

 

- How is the success (or failure!) of the experiment assessed?

[HC]: Backward compatible is verified and abstraction works as expected.

      Some critical misconfigurations should be detected and alarmed.

 

- Are there plans to bring this back for consideration on the standards

  track if certain criteria are satisfied?

[HC]: We have a plan for using IS-IS TTZ, which may help.

 

- Is evaluation of the relative merits of node and mesh abstraction part

  of the experiment?

[HC]: The evaluation will focus on node model.  

 

---

 

Section 1

 

The WG may have established a different practice, but it used to be

normal to reference RFC 1195 alongside ISO 10589.  (You do have 1195

listed in the references section, but you don't actually reference it).

[HC]: Added reference to RFC 1195.

 

---

 

Section 1

 

   There are scalability issues in using areas as the number

   of routers in a network becomes larger and larger.

 

Maybe what you're trying to say in this section (and it is important

because it gives the whole motivation for this work) is that there are

scalability issues with a single IS-IS area as the number of routers in

the area grows. (You might explain what those issues are.)

[HC]: Added some details.

When an IS-IS area becomes larger, its convergence on a network event 

such as a link down will take a longer time. During the period of network

converging, more traffic that is transported through the network area

will get lost.

 

Then you can go on to say how splitting into multiple levels and having

multiple L1 areas mitigates the scaling issues. And then you can

continue with your text about why splitting an IS-IS system as it grows

can be hard.

[HC]: Added some details.  

It needs a careful planning and many configurations on the network.

 

---

 

Section 2

 

   A Topology-Transparent Zone (TTZ) may be deployed to resolve some

   critical issues such as scalability in existing networks and future

   networks.

 

This sounds like you have a number of critical issues in mind, but you

only mention scalability. Are there others you can list, or should you

reduce this text to just...

 

   A Topology-Transparent Zone (TTZ) may be deployed to resolve the

   critical issue of scalability in existing network and future

   networks.

[HC]: Updated the text as you suggested.

---

 

Section 2

 

   o  Abstracting a zone as a TTZ virtual entity, which is a single

      virtual node or zone edges' mesh, SHOULD be smooth with minimum

      service interruption.

 

I *think* you are talking about the transition from not using TTZ to

using TTZ, but it could be a lot clearer.

 

A forward pointer to 4.1.4 might be useful. And 4.1.4 really should

describe some of the processing governed by the OPS bits in 4.2.1.

[HC]: Updated the text accordingly.

 

---

 

Section 2

 

   o  De-abstracting (or say rolling back) a TTZ virtual entity to a

      zone SHOULD be smooth with minimum service interruption.

 

This is similarly unclear, and it sounds like you might be talking

about turning off a zone (i.e., moving all of the Zone Nodes into the

surrounding area and removing the zone), or you could be talking about

moving a single node from inside to outside the zone.

[HC]: Updated the text accordingly.

      Transforming (or say rolling back) a TTZ virtual entity using TTZ

      back to its zone (i.e., its original block of network area 

      not using TTZ) (refer to Section 5.2)

      SHOULD be smooth with minimum service interruption.

 

---

 

Section 2

 

   o  Users SHOULD be able to easily set up an end-to-end service

      crossing TTZs.

 

I am not clear what a "service" is in this context. Assuming we're not

talking about TE extensions, isn't the service simply that the user

sends packets and they are routed by the network?

[HC]: Removed it.

 

---

 

Section 4

 

I think the start of this section needs to add a little about the limits

of a TTZ. In particular:

- Is a TTZ restricted to reside within a single level?

[HC]: All the nodes in a zone must be L1 nodes except for some

zone edge nodes are L1/L2 nodes;

All the nodes in a zone must be L2 nodes except for some 

zone edge nodes are L1/L2 nodes; or

All the nodes in a zone must be L1/L2 nodes.

 

- Is a TTZ restricted to lie within a single area?

[HC]: Yes.

 

- What happens if one of the zone nodes is an L1/L2 router?

[HC]: In this case, if the other zone nodes are L1 routers,

all the zone nodes are abstracted to be an L1 virtual node;

if the other zone nodes are L2 routers, 

all the zone nodes are abstracted to be an L2 virtual node.

 

  - Presumably, depending on the answer to the first question, this

    could only happen if the node in question is a zone edge/border node

    But, even then it is complicated: does the abstracted node become an

    L1/L2 router?

[HC]: If all the zone nodes are L1/L2 routers, the abstracted

node becomes an L1/L2 router.

 

---

 

4.1

OLD

  Each of these links connects a zone neighbor.

NEW

  Each of these links connects to a zone neighbor.

END

[HC]: Changed the text as you suggested.

 

---

 

4.1

   The virtual node ID may be derived from the zone ID.

 

Maybe say how else it could be specified and how the implementer or

deployer makes this choice.

[HC]: Added more details.

 

---

 

A useful modification to Figures 1 and 2 would be to add a circuit from

R15 to R65 in Figure 1 and show how this becomes a second 'parallel'

circuit from R15 to Rz in Figure 2.

[HC]: Added the circuit from R15 to R65 as you suggested.

 

---

 

4.1.1

 

   A TTZ MUST hide the information inside the TTZ from the outside.  It

   MUST NOT directly distribute any internal information about the TTZ

   to a router outside of the TTZ.

 

   For instance, the TTZ in the figure above MUST NOT send the

   information about TTZ internal router R71 to any router outside of

   the TTZ in the routing domain; it MUST NOT send the information about

   the circuit between TTZ router R61 and R65 to any router outside of

   the TTZ.

 

These "for instance" examples are good in that they are true. But they

imply some things by omission, and I don't think you mean to make those

implications.

 

That is, the first paragraph is much clearer and definitive. But your

second paragraph, by calling out some special cases of "internal

information" makes it ambiguous whether, for example, the router R61 is

advertised outside the TTZ. (Of course, it isn't.)

 

It may be better to delete the second paragraph, and go straight to the

following paragraph that describes what is seen outside the TTZ by

directly describing what *is* advertised rather than providing a partial

list of what is not advertised.

[HC]: Removed the second paragraph accordingly.

 

---

 

I think that the subsections of 4.1 cover all of the necessary

information. My list of things to cover is:

- zone edge/border nodes form adjacencies with zone neighbor nodes using

  the identity of the aggregate zone node and not their own identities

[HC]: In Section 4.1.4.  Adjacency Establishment

 

- zone nodes continue to operate IS-IS as normal to advertise zone nodes

  and zone links within the zone

[HC]: In Section 4.4.1.  Advertisement of LSPs within Zone

 

- zone edge/border nodes do not advertise or readvertise LSPs that

  originated within the zone to neighbors outside the zone

[HC]: In Section 4.1.4.  Adjacency Establishment 

      In Section 4.4.1.  Advertisement of LSPs within Zone

 

- zone nodes continue to operate IS-IS as normal to re-advertise LSP

  that originated outside the zone

[HC]: In Section 4.1.4.  Adjacency Establishment 

      In Section 4.4.2.  Advertisement of LSPs through Zone

 

- the zone leader is responsible for deriving the aggregate node

  information that represents the node and for originating LSPs for this

  aggregate node

[HC]: In Section 4.1.3.  LS Generation for Zone as a Single Node

 

- zone nodes re-advertise LSPs originated by the zone leader on behalf

  of the aggregate zone node on all circuits including those that

  connect to zone neighbor nodes

[HC]: In Section 4.1.3.  LS Generation for Zone as a Single Node

 

- when a zone edge/border node readvertises the LSPs for the aggregate

  zone node, it does so as it had originated the LSP

[HC]: In Section 4.1.4.  Adjacency Establishment 

 

- when any zone edge/border node receives an LSP that reports itself as

  originating from the aggregate zone node, the edge/border node

  suppresses the LSP

[HC]: In Section 4.1.4.  Adjacency Establishment 

      In Section 5.1.  Transfer Zone to a Single Node

 

- zone nodes do not install routing state resulting from advertisements

  of LSPs describing the aggregate zone node

[HC]: In Section 4.1.5.  Computation of Routes

 

As I say, I think you have all this in the subsections of 4.1, but I had

to hunt around to find all of this. It might be helpful to give a clear

summary of the behaviors.

[HC]: Added a summary of these behaviors with forward pointers.

 

---

 

4.1.2

 

   The leader election mechanism described in

   [I-D.ietf-lsr-dynamic-flooding] may be used to elect the leader for

   the zone.

 

"may be used" or "are used"?

[HC]: Changed it accordingly.

 

---

 

4.1.2

 

   Somewhere you need to cover what happens if the zone leader fails

   but the zone remains otherwise fully connected. Does the new leader

   start from scratch, or does it try to retain the zone ID etc.?

[HC]: Added the text below:

    When the existing zone leader fails, a new zone leader is elected.

    The new leader originates the LSPs for the virtual node based

    on the LSPs received from the failed leader. It retains the 

    System ID of each LSP ID and the live adjacencies between

    the virtual node and the zone neighbors.

 

---

 

4.1.4 attempts to do two things:

- describe the migration from not-a-zone to the use of a zone

- describe the steady state zone behavior

I think it would be helpful to split these out into separate sections.

In particular, the migration from not-a-zone to zone is only needed in

operational networks.

[HC]: Split these into two separate sections.

 

---

 

4.2

 

   The following TLV is defined in IS-IS.

 

I think...

 

   This document defines a new TLV for use in IS-IS as follows.

[HC]: Used the text as you suggested.

 

---

 

4.2.1

 

   The format of IS-IS Zone ID TLV is illustrated below.  It may be

   added into an LSP for a zone node.

 

s/may/MUST/

[HC]: Changed "may" to "MUST".

 

---

 

4.2.1

 

   If every link of a zone edge node is a zone link

 

Doesn't that mean that the zone edge node is not a zone edge node?

[HC]: Removed the related text.

 

---

 

4.2.1

 

To be honest, I found the description of the processing governed by

the OPS bits to be pretty complicated.

 

I would recommend adding a new section (related to 4.1.4) that talks

through the process in clear steps. Then this txt can just list the

meanings of the bits and point back to the process description.

 

Maybe this is what sections 5, 6.2, and 6.3 are for, in which case cut

down the explanation here and provide forward pointers.

[HC]: Cut down the text here and added forward pointers.

 

---

 

Figures 4 and 5. I think you have defined the types for these two

sub-TLVs (1 and 2).

[HC]: Changed them accordingly.

 

---

 

4.2.1

 

I wonder how many neighbors a zone might have. It could be a fairly big

number, I suspect, although obviously it depends on how the operator

decides to chop up the area into zones (for which I don't find any

guidance).

 

The size of the ISN and ESN would appear to be a function of the number

of neighbors times (IDlength+3). Is there a practical constraint on the

size of the TLVs which places a limit on the number of neighbors that a

zone can have? This would be an important design consideration for the

operator. Maybe it is another feature for experimentation.

[HC]: The number of zone neighbors may be big. When a zone is abstracted

as a single virtual node, all these zone neighbors are put into one or 

more extended IS reachability TLVs in the LSPs for the virtual node,

which are originated by the leader of the zone. 

One TLV can store 20+ neighbors. Ten TLVs in two LSPs can contain 200+ 

neighbors. 

 

---

 

4.2.1

 

The same neighbour may have two links to the zone and not necessarily

through the same edge/border node (see my previous point). In this case,

might the different links have different metrics? I think so, but I

don't see how that is encoded in the sub-TLVs.

[HC]: This may follow a normal implementation. In default, one link with

lower metric is included in the LSPs originated for the virtual node. 

 

---

 

6.1

 

There is probably something to be said about what happens if the

configuration of the zone ID is not consistent across the zone. Is it

as simple as you ending up with two zones?

[HC]: When the configuration of the zone ID is not consistent across 

      the zone, some unexpected results will be generated.

      For example, when two different zone IDs are configured 

      for the zone, two virtual nodes for two zones may be seen

      in the network. These are not expected. Once the unexpected

      results are seen, the inconsistent configurations MUST be fixed.

 

What is the scope of uniqueness of the zone ID? I think it only has to be

unique in the zone and with the neighbors. Obviously there are ways to

make this safe (such as area or global uniqueness). What are the

constraints?

[HC]: Added some constraints.

A zone ID MUST be unique in an AS. It MUST not be any IP address

in the AS from which a system ID is transformed to and used.

 

---

 

6.2

 

   When receiving

   the command, the node distributes it to every zone node.

 

Is this in the management plane or in IS-IS? I can see how it could be

in IS-IS if the configured node is the zone leader and it just starts

sending the zone TLV and all of the edge nodes are identified in

sub-TLVs such that a receiving node is either an edge or an internal

node. But I don't see how it works if the configured node is just some

internal or edge node and the leader has to be elected.

 

Similarly...

   If automatic transferring zone to node is enabled, the user does not

   need to issue the command.  A zone node, such as the zone leader,

   will distribute the "command" to every zone node after determining

   that the configuration of the zone has been finished.

...what is the command and how is it distributed?

 

Same sort of issues in 6.3

[HC]: Updated the related text and referred to Section 5.1.

 

---

 

Section 7 is a bit suspect! What would happen if a zone TLV was sent by

a compromised router or added to an LSP by a mid-wire attacker? I would

be sympathetic to you saying that if an attacker can do either of these

things then there are many far worse things they can do, but I think you

should call out:

- what sort of attacks are possible

- what damage they might do

- how these attacks might be detected

- what protections are available (references would be enough)

[HC]: Added text for this.

 

---

 

Section 8

 

   Under the registry name "IS-IS TLV Codepoints", IANA is requested to

   assign a new registry type for Zone ID as follows:

 

I think...

 

   IANA is requested to make a new allocation in the "IS-IS TLV

   Codepoint Registry" under the registry name "IS-IS TLV Codepoints"

   as follows:

[HC]: Updated the text as you suggested.

 

---

 

Section 8

 

I recommend you tell IANA whether you want the new TLV type to be less

than or greater than 255.

[HC]: Added some text for this.

 

---

 

Section 8

 

   IANA is requested to create a new sub-registry "Adjacent Node ID Sub-

   TLVs" on the IANA IS-IS TLV Codepoints web page as follows:

 

I recommend you call the new sub-registry "Sub-TLVs for TLV type TBD1

(Zone ID TLV)"

[HC]: Updated the text accordingly.

_______________________________________________
Lsr mailing list
Lsr@ietf.org
https://www.ietf.org/mailman/listinfo/lsr

Re: [Lsr] A review of draft-ietf-lsr-isis-ttz

Reply via email to