Re: [GROW] RtgDir review: draft-ietf-grow-ix-bgp-route-server-operations-03.txt

John G. Scudder Tue, 21 Oct 2014 11:24:46 -0700

Hi Nick,

A few remarks and nits below. You can assume that I'm fine with anything I've 
trimmed.


On Oct 20, 2014, at 8:06 PM, Nick Hilliard <n...@inex.ie> wrote:
...
> thanks for your extensive review - it has been very helpful.

You're welcome.

...
> Long term link persistence for rfc references is a problem that the ietf
> probably needs to deal with separately

Agreed.

> by downloading the content at the
> time of publication and storing in its own archive.

I bet that would run afoul of copyright release issues in many cases, but in 
any case I'd be glad to see the issue addressed somehow. By someone else.

...
>> In the case where P_avg (the arithmetic mean number of unique paths
>> received per route server client) remains roughly constant even as the
>> number of connected clients increases, this relationship can be
>> rewritten as O((P_avg * N) * N) or O(N^2).
>> 
>> I don't see where the second factor of N comes from. You're basically
>> expanding the P in the first expression as P_avg * N -- but why? 
> 
> yes, this is not as clear as it could be.
> 
> First, to clarify: this paragraph is concerned only with network traffic
> requirements, rather than with cpu / memory.
> 
> Assume for a moment that each client announces a constant P_avg unique
> routes to the route-server and that there are N clients.  The total number
> of unique paths received by the route server will be:
> 
> P_tot = P_avg * N
> 
> where for the sake of argument P_avg is constant.
> 
> The route server will create a RIB containing P_tot entries and will send
> that N clients.  The total number of prefix announcements from the route
> server will be O((P_tot) * N) = O((P_avg * N) * N) = O(N^2).  This is a
> worst case situation and assumes that each prefix has a different attribute
> set.
> 
> To clarify this in the text, I've changed to:
> 
>> Regardless of whether any Loc-RIB optimization technique is
>> implemented, the route server's theoretical upper-bound network
>> bandwidth requirements will scale according to O(P_tot * N), where
>> where P_tot is the total number of unique paths received by the route 
>> server and N is the total number of route server clients.

where where in the
the spring

> and then clarified
> 
>> Symbolically, this means that P_tot = P_avg * N.
> 
>> I think
>> this would only apply if add-path all-paths was chosen as the path
>> hiding mitigation strategy -- but this is not touched on in
>> route-server-operations, only in ix-bgp-route-server, and besides that
>> the beginning of the paragraph implies you're analyzing the multiple
>> Loc-RIB strategy, so I don't guess all-path is what you were thinking
>> of. If you're not doing all-path, the O(N^2) analysis is wrong AFAICT.
>> To see this, consider that the inbound routes require O(P_avg * N) which
>> is just O(N), but the number of routes you're going to advertise is
>> bounded by the size of the Internet routing table, which is a constant
>> for purposes of this analysis, so also O(N). In and out are summed, not
>> multiplied, so the whole thing works out to be O(N), not O(N^2).
> 
> Some spherical cows in a vacuum may have been harmed during this analysis.

:-)

> The problems revolve around the assumptions, namely:
> 
> 1. P_tot = P_avg * N
> 2. P_avg is a realistic characterisation of the number of prefixes
> announced by each client.
> 3. P_tot is unbounded
> 4. different attribute sets per prefix
> 
> You're correct that P_tot is bound above by the size of the DFZ and after a
> certain stage, bandwidth requirements will be linear, O(N).  But until the
> point at which this becomes the upper bound, theoretical scaling growth
> will tends towards being quadratic.

I agree with all that. However I think the point at which DFZ size becomes the 
upper bound is well below the point at which a practical problem rears its ugly 
head.

> Most prefixes will use one of a limited number of attribute sets, leading
> to obvious transmission optimisation.
> 
> The stddev for P_n is very large indeed.  Consider AS6939 (currently 58k
> prefixes) and Joe's WISP Service (1 prefix):  both are route-server users.
> 
> Yes, add-path would add another level of complication in the analysis, but
> at the moment there are no ebgp add-path implementations, so we can't test.
> 
>> So I think this needs to either be corrected, or the assumptions need to
>> be better explained. Moving on:
>> 
>> This quadratic upper bound on the network traffic requirements
>> indicates that the route server model will not scale to arbitrarily
>> large sizes.
>> 
>> If you continue to think this sentence is warranted, I think it should
>> be better quantified. Of course nothing can scale to *arbitrarily* large
>> sizes, but that still leaves a lot to the imagination. I would think it
>> would be beneficial for an IX operator reading this document to be able
>> to have some idea of how practical the limitation is. Since the analysis
>> in question is looking at control traffic bandwidth consumption, it
>> wouldn't be too onerous to throw some simple assumptions up against it
>> -- for example, "if we suppose a RS receives on average 100,000 routes
>> from each client with a rate of change of 10 routes/second, sends on
>> average 1,000,000 routes to each client with a rate of change of 100
>> routes/second, and that each route consumes on average 50 bytes in a BGP
>> UPDATE message, simple arithmetic shows that a GigE connection to that
>> RS will be fully saturated by the time the number of clients reaches
>> 25,000." (Which does not seem like a very practical limitation, the RS
>> will hit a CPU or memory bottleneck first.)
>> 
>> Anyway, maybe you will decide on reconsideration of the big-O analysis
>> that this bit is not needed at all, which would be OK with me.
> 
> yes and no.  This stuff is implementation dependent and the big-O analysis
> is only of limited value from a practical point of view.  

Agreed -- which leads me to wonder if its inclusion in the document contributes 
more light than it does smoke.

> It's fine for
> smaller systems, but breaks for larger ones.
> 
> From a measurement point of view, you're correct that cpu bottlenecks hit
> first.  Implementation-wise, memory is cheap to fix; cpu is harder because
> individual cores aren't speeding up much more these days, and so from an
> implementation point of view, RSs benefit from careful Loc-RIB
> optimisation.  Bandwidth is also cheap because you can throw a 10G pipe at
> the server and the problem will then generally revert to a network card
> driver problem if you can't depend on zero-copy data transmission or back
> to a CPU problem if you have unique update sets per client.  CPU will be an
> issue if you use actual Loc-RIB copies per client (quagga) instead of a
> single virtual loc-ribs with per-client diffs (BIRD / IOS).  And most
> organisations don't need their own loc-rib anyway.  After all, people
> connect to route servers in order to interconnect promiscuously rather than
> take the safer route of bilateral peering sessions.
> 
> So yeah, scaling is still a serious problem.  The performance difference
> between the fastest and slowest RS implementations is measured in orders of
> magnitude.
> 
> Which comes back to the issue of where to draw the line.  There's piles
> that could be said, much of it highly implementation dependent (i.e. not
> especially suitable for a persistent recommendation document).  Probably it
> would be useful to have a better explanation of how the assumptions break
> down in practice on larger systems.
> 
> I've added a new paragraph before "Tackling Scaling Issues" which reads:
> 
>> In practice, most prefixes will be associated with a limited number of
>> BGP path attribute sets, allowing more efficient transmission of BGP
>> routes from the route server than the theoretical analysis suggests.  In
>> the analysis above, P_tot will increase monotonically according to the
>> number of clients, but will have an upper limit of the size of the full
>> default-free routing table of the network in which the IXP is located. 
>> Observations from production route servers have shown that most route
>> server clients generally avoid using custom routing policies and
>> consequently the route server may not need to deploy per-client
>> Loc-RIBs.  These practical bounds reduce the theoretical worst-case
>> scaling scenario to the point where route-server deployments are
>> manageable on even on larger IXPs.

"on even on" -> "even on".

With the addition, I think the new section is sufficiently correct. I'm just 
not sure it helps the reader very much. I leave it to you to decide.

> the next paragraph starts:
>> 4.2.1.  Tackling Scaling Issues
>>            The problem of scaling route servers still presents serious
>>            practical challenges and requires careful attention.  Scaling
>>            analysis indicates problems [...]
> 
> 
>> - S 4.2.2.1,
>> 
>> If the route server operator has prior knowledge of interconnection
>> relationships between route server clients, then the operator may
>> configure separate Loc- RIBs only for route server clients with unique
>> outbound routing policies.
>> 
>> It wasn't obvious to me what "outbound" applies to -- the client? The
>> RS? -- and for that matter why an inbound policy (on the RS) might not
>> apply. Possibly this could be remedied by simply dropping the adjective
>> "outbound".
> 
> removing "outbound" reduces the ambiguity; probably it's reduced enough to
> make the meaning clear from the context but being an author, it's difficult
> to tell (r116).

WFM.

Regards,

--John
_______________________________________________
GROW mailing list
GROW@ietf.org
https://www.ietf.org/mailman/listinfo/grow

Re: [GROW] RtgDir review: draft-ietf-grow-ix-bgp-route-server-operations-03.txt

Reply via email to