Hi Nick, A few remarks and nits below. You can assume that I'm fine with anything I've trimmed.
On Oct 20, 2014, at 8:06 PM, Nick Hilliard <n...@inex.ie> wrote: ... > thanks for your extensive review - it has been very helpful. You're welcome. ... > Long term link persistence for rfc references is a problem that the ietf > probably needs to deal with separately Agreed. > by downloading the content at the > time of publication and storing in its own archive. I bet that would run afoul of copyright release issues in many cases, but in any case I'd be glad to see the issue addressed somehow. By someone else. ... >> In the case where P_avg (the arithmetic mean number of unique paths >> received per route server client) remains roughly constant even as the >> number of connected clients increases, this relationship can be >> rewritten as O((P_avg * N) * N) or O(N^2). >> >> I don't see where the second factor of N comes from. You're basically >> expanding the P in the first expression as P_avg * N -- but why? > > yes, this is not as clear as it could be. > > First, to clarify: this paragraph is concerned only with network traffic > requirements, rather than with cpu / memory. > > Assume for a moment that each client announces a constant P_avg unique > routes to the route-server and that there are N clients. The total number > of unique paths received by the route server will be: > > P_tot = P_avg * N > > where for the sake of argument P_avg is constant. > > The route server will create a RIB containing P_tot entries and will send > that N clients. The total number of prefix announcements from the route > server will be O((P_tot) * N) = O((P_avg * N) * N) = O(N^2). This is a > worst case situation and assumes that each prefix has a different attribute > set. > > To clarify this in the text, I've changed to: > >> Regardless of whether any Loc-RIB optimization technique is >> implemented, the route server's theoretical upper-bound network >> bandwidth requirements will scale according to O(P_tot * N), where >> where P_tot is the total number of unique paths received by the route >> server and N is the total number of route server clients. where where in the the spring > and then clarified > >> Symbolically, this means that P_tot = P_avg * N. > >> I think >> this would only apply if add-path all-paths was chosen as the path >> hiding mitigation strategy -- but this is not touched on in >> route-server-operations, only in ix-bgp-route-server, and besides that >> the beginning of the paragraph implies you're analyzing the multiple >> Loc-RIB strategy, so I don't guess all-path is what you were thinking >> of. If you're not doing all-path, the O(N^2) analysis is wrong AFAICT. >> To see this, consider that the inbound routes require O(P_avg * N) which >> is just O(N), but the number of routes you're going to advertise is >> bounded by the size of the Internet routing table, which is a constant >> for purposes of this analysis, so also O(N). In and out are summed, not >> multiplied, so the whole thing works out to be O(N), not O(N^2). > > Some spherical cows in a vacuum may have been harmed during this analysis. :-) > The problems revolve around the assumptions, namely: > > 1. P_tot = P_avg * N > 2. P_avg is a realistic characterisation of the number of prefixes > announced by each client. > 3. P_tot is unbounded > 4. different attribute sets per prefix > > You're correct that P_tot is bound above by the size of the DFZ and after a > certain stage, bandwidth requirements will be linear, O(N). But until the > point at which this becomes the upper bound, theoretical scaling growth > will tends towards being quadratic. I agree with all that. However I think the point at which DFZ size becomes the upper bound is well below the point at which a practical problem rears its ugly head. > Most prefixes will use one of a limited number of attribute sets, leading > to obvious transmission optimisation. > > The stddev for P_n is very large indeed. Consider AS6939 (currently 58k > prefixes) and Joe's WISP Service (1 prefix): both are route-server users. > > Yes, add-path would add another level of complication in the analysis, but > at the moment there are no ebgp add-path implementations, so we can't test. > >> So I think this needs to either be corrected, or the assumptions need to >> be better explained. Moving on: >> >> This quadratic upper bound on the network traffic requirements >> indicates that the route server model will not scale to arbitrarily >> large sizes. >> >> If you continue to think this sentence is warranted, I think it should >> be better quantified. Of course nothing can scale to *arbitrarily* large >> sizes, but that still leaves a lot to the imagination. I would think it >> would be beneficial for an IX operator reading this document to be able >> to have some idea of how practical the limitation is. Since the analysis >> in question is looking at control traffic bandwidth consumption, it >> wouldn't be too onerous to throw some simple assumptions up against it >> -- for example, "if we suppose a RS receives on average 100,000 routes >> from each client with a rate of change of 10 routes/second, sends on >> average 1,000,000 routes to each client with a rate of change of 100 >> routes/second, and that each route consumes on average 50 bytes in a BGP >> UPDATE message, simple arithmetic shows that a GigE connection to that >> RS will be fully saturated by the time the number of clients reaches >> 25,000." (Which does not seem like a very practical limitation, the RS >> will hit a CPU or memory bottleneck first.) >> >> Anyway, maybe you will decide on reconsideration of the big-O analysis >> that this bit is not needed at all, which would be OK with me. > > yes and no. This stuff is implementation dependent and the big-O analysis > is only of limited value from a practical point of view. Agreed -- which leads me to wonder if its inclusion in the document contributes more light than it does smoke. > It's fine for > smaller systems, but breaks for larger ones. > > From a measurement point of view, you're correct that cpu bottlenecks hit > first. Implementation-wise, memory is cheap to fix; cpu is harder because > individual cores aren't speeding up much more these days, and so from an > implementation point of view, RSs benefit from careful Loc-RIB > optimisation. Bandwidth is also cheap because you can throw a 10G pipe at > the server and the problem will then generally revert to a network card > driver problem if you can't depend on zero-copy data transmission or back > to a CPU problem if you have unique update sets per client. CPU will be an > issue if you use actual Loc-RIB copies per client (quagga) instead of a > single virtual loc-ribs with per-client diffs (BIRD / IOS). And most > organisations don't need their own loc-rib anyway. After all, people > connect to route servers in order to interconnect promiscuously rather than > take the safer route of bilateral peering sessions. > > So yeah, scaling is still a serious problem. The performance difference > between the fastest and slowest RS implementations is measured in orders of > magnitude. > > Which comes back to the issue of where to draw the line. There's piles > that could be said, much of it highly implementation dependent (i.e. not > especially suitable for a persistent recommendation document). Probably it > would be useful to have a better explanation of how the assumptions break > down in practice on larger systems. > > I've added a new paragraph before "Tackling Scaling Issues" which reads: > >> In practice, most prefixes will be associated with a limited number of >> BGP path attribute sets, allowing more efficient transmission of BGP >> routes from the route server than the theoretical analysis suggests. In >> the analysis above, P_tot will increase monotonically according to the >> number of clients, but will have an upper limit of the size of the full >> default-free routing table of the network in which the IXP is located. >> Observations from production route servers have shown that most route >> server clients generally avoid using custom routing policies and >> consequently the route server may not need to deploy per-client >> Loc-RIBs. These practical bounds reduce the theoretical worst-case >> scaling scenario to the point where route-server deployments are >> manageable on even on larger IXPs. "on even on" -> "even on". With the addition, I think the new section is sufficiently correct. I'm just not sure it helps the reader very much. I leave it to you to decide. > the next paragraph starts: >> 4.2.1. Tackling Scaling Issues >> The problem of scaling route servers still presents serious >> practical challenges and requires careful attention. Scaling >> analysis indicates problems [...] > > >> - S 4.2.2.1, >> >> If the route server operator has prior knowledge of interconnection >> relationships between route server clients, then the operator may >> configure separate Loc- RIBs only for route server clients with unique >> outbound routing policies. >> >> It wasn't obvious to me what "outbound" applies to -- the client? The >> RS? -- and for that matter why an inbound policy (on the RS) might not >> apply. Possibly this could be remedied by simply dropping the adjective >> "outbound". > > removing "outbound" reduces the ambiguity; probably it's reduced enough to > make the meaning clear from the context but being an author, it's difficult > to tell (r116). WFM. Regards, --John _______________________________________________ GROW mailing list GROW@ietf.org https://www.ietf.org/mailman/listinfo/grow