Re: [Lsr] Questions on draft-white-lsr-distoptflood

Les Ginsberg (ginsberg) Mon, 28 Nov 2022 00:40:12 -0800

Tony –

In the interest of brevity, I am not going to respond in detail to each of your 
points. My reply focuses on two things.


1)You can successfully deploy this algorithm in the presence of nodes which do 
NOT support this algorithm. But you cannot successfully deploy this algorithm 
in the presence of nodes which enable a different flooding reduction algorithm.
Given that this is not the only flooding reduction algorithm which has already 
been proposed – nor is it likely the last one to be proposed – it would seem 
advantageous and prudent to provide a means for nodes to know what algorithm is 
in use and ensure that multiple algorithms are not enabled simultaneously – 
which is what draft-ietf-lsr-dynamic-flooding provides. You seem to be saying 
“this is the only flooding reduction algorithm we need” and you are not 
interested in allowing deployment of anything else – now or in the future. This 
lessens my enthusiasm for this draft.

The mechanisms proposed in draft-ietf-lsr-dynamic-flooding are analogous to 
what is used for DIS election and (more recently) for selecting the winning FAD 
for a given flex-algo. Given the significant deployment of flex-algo and the 
long history of DIS election, I am surprised at the degree of concern you have 
for the use of these mechanisms.

2)Regarding the use of PSNPs…you propose to send a PSNP (once apparently) which 
has the LSP entries for all the LSPs which you chose NOT to flood to a given 
node (minus any LSPs for which you may have received an explicit ack) in the 
most recent time interval - suggested to be one second.
What will happen when you send this? Let’s use a simple example where one LSP 
was selectively flooded – call it A.00-01(Seq #100).
NOTE: This example assumes a P2P circuit.

a)Neighbor receives the PSNP, already has A.00-01(Seq #100) in its LSPDB – no 
action taken. All is good.
b)Neighbor receives the PSNP, does not have A.00-1(Seq #100) in its LSPDB – 
sends a PSNP back to the originator requesting that the LSP be flooded. At this 
point I assume normal flooding procedures apply i.e., SRM flag is set, causing 
the LSP to be flooded, and I assume SRM remains set until the LSP is 
acknowledged.
All is good – but the additional flooding is likely to be redundant as the node 
which had the responsibility for sending this LSP to your neighbor should be 
doing so reliably.
c)Neighbor does not receive the PSNP. If the neighbor does not have A.00-01(Seq 
#100) in its database, the one time sending of the special PSNP won’t trigger 
sending of the missing LSP. As the draft does not propose that the special PSNP 
be resent, I assume during the next time interval the only LSP entries that 
would be sent in the next special PSNP would be other LSPs that were partially 
flooded in the subsequent interval – not A.00-01.

Periodic CSNPs can be dropped as well, but as periodic CSNPs are guaranteed to 
be sent continuously at some interval and they cover the entire LSPDB, 
reliability of the Update process is assured. Under some pathological 
conditions it might take a significant amount of time to converge, but it is 
assured.

What then do these special PSNPs provide? It could be argued that they provide 
a lower cost and more targeted recovery mechanism in some circumstances – and 
that using them in conjunction with periodic CSNPs may speed convergence. 
However, I think the existing proposal discussed in Section 2.3 of the draft 
lacks detail and is unlikely to achieve this goal in most circumstances.

The time period of 1 second is too aggressive. You may end up sending the 
special PSNP before the node which has the responsibility for flooding the LSP 
to your neighbor has even had a chance to do the flooding – which will 
undermine the benefits of the flooding reduction.

If you consider the cost of sending/receiving a PSNP is roughly equivalent to 
the cost of sending/receiving an LSP, you will have created the equivalent of 
full mesh flooding every second since every node can expect to receive a PSNP 
from every neighbor whenever an LSP update is triggered. NOTE: The relative 
impact will be more noticeable when a small # of LSPs are updated.

And since the node which is responsible for flooding to a particular neighbor 
should be doing so reliably, under most circumstances the special PSNP is not 
needed at all – so why choose an aggressive time interval for sending it?

Periodic CSNPs are sufficient – are typically done at a slow rate (10s of 
seconds) – and apparently (from your response below) you seem to intend to send 
periodic CSNPs also (though the draft does not mention this). I am not seeing 
the benefit of the special PSNP – but if you are committed to this, please 
provide a more robust description of how they should be used in the draft and 
an analysis of the benefits under some realistic flooding scenarios.

   Les


From: Tony Przygienda <tonysi...@gmail.com>
Sent: Friday, November 25, 2022 1:06 AM
To: Les Ginsberg (ginsberg) <ginsb...@cisco.com>
Cc: draft-white-lsr-distoptflood.auth...@ietf.org; lsr@ietf.org
Subject: Re: [Lsr] Questions on draft-white-lsr-distoptflood


Les, bits delay since I had to think a bits about your comment to do it justice 
and it's bit long'ish

1. So, to start with a cut and dry summary and reasoning for it, I am firmly 
against adding signaling to the whole thing by some means (or rather any 
procedures to act upon distribution of info about the algorithm used by any of 
the nodes involved, i.e. I'm ok with having the algorithm advertised solely for 
info purposes with me though I don't see what function it serves except 
detecting nodes that do not reduce yet in transition of a network or maybe, as 
you say, detect algorithm mismatch). More detailed reasoning follows:

a. First reason is the fact that the additional flexibility of maybe having one 
day some better hash algorithm will add very serious amount of complexity in 
implementation/behavior in case we are talking about adding it to the 
centralized variant of the dynamic flooding draft and having a leader 
advertising the algorithm.
    i. backup machinery needs to be added/spec'ed properly. What does the 
network do if backup has different algorithm than the current leader? First we 
would have a transition phase, some nodes have old algorithm, some the old, 
network may stop converging for a bit that way, worst case we partition the PGL 
algorithm advertisement from new nodes so we have to wait CSNP * diameter etc. 
Big network bleep is the result. I know there is lots verbiage in the dynamic 
flooding draft but I know the reality of implementations of such things and 
they are extraordinarily high for the bit flexibility the whole thing would buy 
us I see you suggesting.
   ii. What happens if PGL doesn't say anything? Default algorithm? Full 
flooding again? in case of full-flooding-regression all of a sudden one fat 
finger on PGL (or PGL moving unexpectedly due to fat finger/some other node 
config changes) can basically crash your network and worst case stop 
convergence if reduction allowed before to converge but full flooding seriously 
slows down everything. I know, this would be a network tethering on the edge 
already but why have additional daemons hiding in a single point of failure on 
top.
  iii. lots of remaining subtle things. e.g. to make sure the whole thing works 
each node havs to compute reachability to the leader (not sure that's in the 
dynamic flooding draft now), otherwise they may use stable LSPs from a leader 
that is gone/partitioned. This reachability computation will have adverse 
effects. The timing is unpredictable in the network and may lead to problems 
mentioned in i).   If nodes don't do the reachability we may end up in Paxos 
unintentionally BTW.

Generally, I can claim that I lived the PGL in ATM so I've seen the "central 
leader in IGP" game. Not excited about it from experience and it was much 
easier in ATM already due to hard state of SVCs. To sum it up again, I see here 
a suggestion to add massive amount of complexity/fragility for an assumed, 
unspecified benefit in the future. As footnote: centralization in an IGP a 
cardinal sin in my eyes moving away from the first premise that made 
distributed routing so successful. I spoke against it and still hold the same 
opinion and if that's heresy I'm more than happy to be bumped off the author's 
list of the dynamic-flooding draft ;-).

so maybe as iv) here:  WHAT additional variables in the hash do you imagine 
would constitute a _better_ algorithm? AFAIS there are none I can imagine and 
the current algorithm provides pretty much best entropy with clearly cap'ed 
state per node needed to balance per LSP originator/fragment. So instead of 
"pledging for flexibility for flexibilitity's sake" I'd rather see you 
suggesting something that would change/improve the behavior in the future/now 
in concrete terms and then let's talk about specifics.

b. Then, as second reason when talking towards a distributed solution, i.e. 
each node flooding the algorithm it uses. We still do NOT know what to do in 
case nodes will advertise different algorithms each, no matter it's advertised 
or not. Shut down the network, fall back to full flooding if one node disagrees 
(which makes every node a potential attack vector)? We had that kind of 
discussion before, last on multi-TLV where you were insisting on killing the 
cap indication so it would be funny to add it here.  Complexity without any 
concrete benefit whatsoever AFAIS and lots of ratholes again.

2. To go to your reliable PSNP/CSNP objection now. First, they were never 
reliable. Neither were LSPs. We can make a very fine argument that if 
PSNPs/CSNPs are not reliable then ISIS will not converge at all. We can start 
to argue then how many we lose and when and how one variation of flooding is 
"more robust" than other and we can actually discover that if the redundancy 
factor in graph is higher than the largest fanout than we are in normal ISIS 
and hence the reduced flooding redundancy factor (in extreme case it's 
basically infinity for existent flooding algorithm in ISIS) + PSNP 
unreliability are two variables (plus network radius + origination rates + etc) 
which in extreme case can be shown to not converge the network anymore no 
matter the flooding (e.g. if the re-origination rate + radius is higher than 
the propagation time under CSNP/PSNP losses).  In short, the objection brings 
nothing new to the table, Les, this has been around forever and we're talking 
here about the fact that any flooding reduction makes flooding "less" reliable 
somewhat. That's trivia.

b. to more productive arguments: the solution does NOT reduce the full CSNP 
advertisement and this will fix any bug in an algorithm. We agree that far I 
think.

3. Then, let's have the up-to-date PSNP in glossary with a better name, e.g. 
"consistency assuring PSNP" or CA-PSNP which describes better what it is. It 
cannot hurt

It goes like this (which I thought was already decently clear in the draft but 
nothing wrong in spelling that out)

a) the algorithm figures out during computation that LSP-ID X/fragment Y is NOT 
flooded on since other RNL members took over. Now, the according LSP-ID 
X/fragment Y is put on PSNP queue of all the members in TN that are your 
neighbors (optimization here) or as the draft says "all your neighbors" which 
is bits too conservative.  Flood out those PSNPs on a second timer unless they 
were killed during normal ISIS processing rules or already went out.  Observe 
that NO changes are made to normal ISIS CSNP/LSP/PSNP processing here except 
dropping those PSNPs into the according queues to go out. If the neighbor gets 
the PSNP and interprets it as something newer, normal procedures kick in. If it 
already has it nothing will happen really per normal procedures.  If your 
implementation is very conservative you can choose yourself super conservative 
constants, e.g. unless you see tripple coverage by other RNLs you flood 
nevertheless. Or if it turns out you send PSNPs to your neighbors in 
expectation that they covered the TNLs and you get requests back, either the 
other TNLs are dead slow or something is off and an alarm can be given as in 
"flooding reduction here struggles". Nothing to do with this solution, this 
will happen on any type of flood reduction, chokepoints may get created (and 
observe that this draft load balances flooding and not only reduces, one of the 
lessons I learned implementing those things in my earlier lives ;-)

So, to sum up the argument chain, I err on the side of simplicity here since 
from experience, simplicity allows us to deploy and stand straight-faced in 
front of customers with very large, dense networks. This draft is something  
that consists of 12 pages including examples and about 4-5 pages boilerplate. 
And on top bases on old clean work and pretty much e'thing in it proven by 
implementation and previous art IME. This vs. an adopted design-by-comittee 
draft of 46 pages that at this point in time I think does not standardize any 
interoperability but standardizes how to find out why things don't interoperate 
due to all possible combinations of centralized vs. distributed plus bring your 
own algorithm on top by every vendor (based on my last read of it) ...

-- tony






On Wed, Nov 23, 2022 at 1:14 AM Les Ginsberg (ginsberg) 
<ginsberg=40cisco....@dmarc.ietf.org<mailto:40cisco....@dmarc.ietf.org>> wrote:
Draft authors -

The WG adoption call reminded me that I had some questions following the 
presentation of this draft at IETF 114 which we decided to "take to the list" - 
but we/I never did.
Looking at the minutes, there was this exchange:

<snip>
Les:           I'm not convinced that you don't need to advertise
               whether a node needs support this. If not, why not define
               this as an algorithm and use the dynamic flooding?
Tony P:        First bring me a case why we need to signal this.
Les:           If I'm not going to flood and I'm expecting someone else
               to flood, and I don't know whether we're in sync.
Tony:          Think it through, the mix with old nodes just fine. The
               old guy still do the full flooding and that's fine.
Les:           You use the term up-to-date PSNP, I have no idea how you
               determine whether the PSNP is "up-to-date"? unlike CSNP,
               PSNP doesn't have the info.
Tony:          You have to list all those things.
Les:           Let's take it to the list.
<end snip>

Question #1: Why not define this as an algorithm and use 
draft-ietf-lsr-dynamic-flooding (in distributed mode)?
This question is of significance both from a correctness standpoint and what 
track (Informational or Standard) the draft should target.

Tony P's reply above suggests this isn't needed - but I don't think this is 
true. The draft itself says in Section 2.1:

<snip>
Once this flooding group is determined, the members of the flooding
   group will each (independently) choose which of the members should
   re-flood the received information.  Each member of the flooding group
   calculates this independently of all the other members, but a common
   hash MUST be used across a set of shared variables so each member of
   the group comes to the same conclusion.
<end snip>

If a "common hash MUST be used across a set of shared variables" (and I agree 
that it MUST) then all nodes which support the optimization MUST agree to use 
the same algorithm. Given that there are likely many hash algorithms which 
could be used, some way to signal the algorithm in use seems to be required.
By publishing a given algorithm(including the hash) and having it assigned an 
identifier in the registry defined in 
https://www.ietf.org/archive/id/draft-ietf-lsr-dynamic-flooding-11.html#section-7.3
 - and using the Area Leader logic defined in the same draft, consistency is 
achieved.
Without that, I don't think this is guaranteed to work.

Note the issue here has nothing to do with legacy nodes - I agree with Tony P's 
comment above that legacy nodes do not present a problem - they just limit the 
benefits.

Question #2: Please define and demonstrate how "up-to-date PSNPs" work to 
recover from flooding failures.

We know that periodic CSNPs robustly address this issue - and their use has 
been recommended for flooding reduction solutions over the years.
Please more completely define "up-to-date PSNPs" and spend some time 
demonstrating how they are guaranteed to work - and consider in that discussion 
that transmission of SNPs of either type is not 100% reliable.

Thanx.

    Les

_______________________________________________
Lsr mailing list
Lsr@ietf.org<mailto:Lsr@ietf.org>
https://www.ietf.org/mailman/listinfo/lsr

_______________________________________________
Lsr mailing list
Lsr@ietf.org
https://www.ietf.org/mailman/listinfo/lsr

Re: [Lsr] Questions on draft-white-lsr-distoptflood

Reply via email to