[Simple-evcorr-users] Has anyone mapped root cause for large L2 networks using SEC?

Tim Peiffer Thu, 17 Nov 2011 08:24:48 -0800

I have a network that has quite a bit of 802.1q and 802.3ad that is 
built largely on L2 switching regionally.  The L2 networks are 
micro-segmented along areas that make sense with administrative domains, 
and then routed by core switching devices.  I am looking for root-cause 
approaches that one could model with SEC.  Has anyone done that work and 
willing to share?


The problem I am trying to define is knowing how generalized a problem 
(maybe power, broadcast, congestion) is, and being accurate.  Within a 
data switching site, all of my devices are adjacent at L3 and considered 
equals of each other.  The only way to define locality of reference is 
to know the L1 and L2 adjacency.  The problem is compounded somewhat 
when one considers SpanningTree and redundant L2  connectivity.

I can imagine that there are a few general approaches that could be done 
with L2.

Scenarios:
1) count match of names.  If the number of devices that are building 
infrastructure cross a given threshold, hten raise an event indicating a 
possible outage.  Subsequent new devices that have the same name are 
suppressed because the building outage event that was earlier raised 
covers it.
2) defined architecture    If you have a defined, known architectural 
dependency chain could be deployed.
     BN-< CN -< RA -< AA -< BA -< EN -< PI, WE
     BN-< DC -< DA -< DE -< managed host
     BN-< BR -< GR
     BN => Backbone Node ; BR => Border Router ; GR => GigaPOP 
Aggregation Router
     DC => DataCenter Router ; DA => DataCenter Aggregator ; DE => 
DataCenter Edge
     CN => Core Router Node ; RA => Regional Aggregator ; AA => Area 
Aggregator ; BA => Building Aggregator
         EN => Edge Node; PI => Power Injector ; WE => Wireless Edge

3) dependency based upon measured adjacency at L1/L2.  This would need 
to use some method such as CDP to figure out at L1.  Any 
unreachability/outage should raise a context (state) noting that a 
device is unavailable.  Given any device, you should know what the 
dependency is between the device and EYE.   If you treat that dependency 
chain as a set, and walk for any outages in the chain.  If a device is 
unreachable in the chain, then you would expect that device to be the 
root-cause.

Analysis:
1) appears to be too problematic and have too many guesses associated.  
There is no positive indication.
2) appears to have some problematic things int that the architectural 
building block may or may not be deployed.  So if an edge plugs directly 
into a core, or more likely other aggregation such as area and regional 
are not deployed then those aggregation points are undefined.
3) This looks to be the best, but would still fall down in the case of 
parallel infrastructure.  In this case it would be fairly easy to get 
around that failure if one were modify the dependency chain to take into 
account all of the measured parallel infrastructure.  In this case, you 
would represent as:
     [ BN|BN] -< [ DC|DC] -< [DA|DA] -< [DE|DE] -< managed host

Tim

-- 
Tim Peiffer
Network Support Engineer
Office of Information Technology
University of Minnesota/NorthernLights GigaPOP

+1 612 626-7884 (desk)


------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Simple-evcorr-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/simple-evcorr-users

[Simple-evcorr-users] Has anyone mapped root cause for large L2 networks using SEC?

Reply via email to