Re: WG review for draft-lapukhov-bgp-routing-large-dc

Jon Mitchell Mon, 20 Oct 2014 19:52:18 -0700

Inline ...

-Jon

On 31/07/14 09:48 +0000, Vitkovský Adam wrote:
> Hello Peter,
> 
> I'd like to ask couple of questions regarding the design and confirm my 
> understanding please. 
> 
> 
> What is the recommended fan out ratio for Tier3 to Tier2 and Tier2 to Tier1 
> please? 
> Tier3 to Tier2 would be 1/4 (so in case one Tier2 device fails the remaining 
> BW still available for a cluster is 75%) (Tier3 device has 4 ECMP paths). 
> Tier2 to Tier1 would be 1/2 (so in case one Tier1 device fails the remaining 
> BW still available for cluster is 87.5%) (Tier2 device has 2 ECMP paths). 
> 
> Or is it more like 1/32 for the Tier3 to Tier2 (Tier3 device has 32 ECMP 
> paths) please?  

This is highly dependent on maximum size (# of clusters) in the DC and
amount of traffic required to be carried between tiers.  YMMV.  If Tier3
in your design is a 64 port TOR, it may be a bit extreme to utilize half
the ports as uplinks, although certainly it is possible.  Most cluster
designs may choose 1/4 or 1/8 but higher is certainly possible.  Making
a recommendation that applies to all DC size and traffic requirements is
probably not feasible.

> 
> 
> 8.2.1. Collapsing Tier-1 Devices Layer. 
>  - I think that as a result of collapsing the number of Tier1 devices to a 
> 1/2 the impact of the failure of a single Tier1 device will increase in 50%. 
>  - Thus wouldn't it be more desirable to leave the same number of Tier1 
> devices and only add links from a particular tier2 device to 
> another/neighboring pair of Tier1 devices please? 
>  - The reduction in port capacity would remain the same. 
>  - However the impact of a failure of a single link or single Tier1 device is 
> unchanged. 
> 
> 
> 
> 8.2. Route Summarization within Clos Topology. 
>  - Since you have mentioned that all the devices are preferably of the same 
> type to accommodate REQ2. 
>  - I'm thinking they would probably have the same FIB capacity right please? 
>  - So if Tier1 device can hold all the DC routes than Tier2 and Tier3 devices 
> can as well right? 
>  - If the FIB size differs between the devices used in various Tiers than 
> summarization is beneficial indeed. 

I think you read these sections as disjoint.  The only purpose of
section 8.2.1 is if you desire to do summarization in the Clos, most
operators may agree the trade-off is not worth it and not summarize in
the fabric.  These operators may also have the same FIB capacity on all
devices.  Other operators may desire summarization in the Clos due to
not selecting devices of same FIB capacity or wanting to reduce the
control plane exposure as suggested in section 8.2.  8.2.1 explains one
way that could be done and the associated trade-offs of doing it.

> 
> If FIB size is a cause for concern would it be possible to utilize a scheme 
> where servers are grouped into server groups, then to define which server 
> groups need to communicate with other set of server groups or everybody or 
> the internet please? 
> This way prefixes could be marked and filters on Tier3 and Tier 2 devices set 
> accordingly -to only allow the necessary prefixes to be accepted form a Peer 
> or inserted from BGP into RIB/FIB. 
> Drawback is of course the increased operational complexity maintaining 
> filters as well as troubleshooting. 
> Though with clear server groups to Tier3 devices (or clusters) mapping scheme 
> the filters would be set once than maintained only occasionally. 
> Also with a clear communities scheme troubleshooting would be straight 
> forward I believe. 
> 
> I'm thinking like in MPLS VRFs if a particular PE (Tier3 device) is serving 
> only a subset of VRFs(server groups) it doesn't really need to hold all the 
> DC routes. 

Implicit in the requirements is full reachability to server subnets in
the design (from every other server subnet and typically with default
providing external connectivity to the CloS as outlined).  If this is
not a requirement, an alternative would be not building a large scale
CloS network but rather build a number of small scale CloS networks that
are custom built to such server groups.  Obviously this limits the
fungability of the equipment deployed.  Also, it should be stated that
operational simplicity is a stated goal of this design.

> 
> 
> 8.3. ICMP Unreachable Message Masquerading. 
>  Another option is to make the network device perform IP address
>  masquerading, 
>  - Does that mean the network device will respond with RID/Loopback IP during 
> traceroute please? 
>  - If so it would be than impossible to pinpoint the link used to forward 
> traffic to the next hop so if there are two IP paths between directly 
> connected devices we wouldn't be able to distinguish the failed one. 
>  - But I guess this kind of setup is not going to be used. 

Yes, typically there is only one connection in the design between two
specific devices in different Tiers, so if the previous hop responded
and the device respond where TTL is exceeded has source of RID, this is
effectively identifying the link.  Sometimes there is more than one
link, but in those cases, at times the operator may be using LAG, making
this valid as well.  All of section 8 is options to the design - if the
specific design an operator chooses has multiple non-LAG links between
two device in seperate tiers, and traceroute response is deemed highly
useful, they may opt for the second option of section 5.2.3 rather than
icmp masquerading.

> 
> 
> 
> And just some nit picking. 
> 7.1. Fault Detection Timing. 
> This feature is sometimes called as "fast fallover" 
>  - Do you mean the "fast external failover" as it only applies to eBGP 
> sessions? 
>  - Or you mean the "fast peering session deactivation" functionality that 
> brings the same functionality to iBGP sessions please? 

It means the first, although more implementations use the word fallover
not failover in the command.  If it is confusing we can add external to
the wording.  This entire draft is about EBGP only design, so the second
does not apply.

_______________________________________________
rtgwg mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/rtgwg

Re: WG review for draft-lapukhov-bgp-routing-large-dc

Reply via email to