I have a network that has quite a bit of 802.1q and 802.3ad that is
built largely on L2 switching regionally. The L2 networks are
micro-segmented along areas that make sense with administrative domains,
and then routed by core switching devices. I am looking for root-cause
approaches that one could model with SEC. Has anyone done that work and
willing to share?
The problem I am trying to define is knowing how generalized a problem
(maybe power, broadcast, congestion) is, and being accurate. Within a
data switching site, all of my devices are adjacent at L3 and considered
equals of each other. The only way to define locality of reference is
to know the L1 and L2 adjacency. The problem is compounded somewhat
when one considers SpanningTree and redundant L2 connectivity.
I can imagine that there are a few general approaches that could be done
with L2.
Scenarios:
1) count match of names. If the number of devices that are building
infrastructure cross a given threshold, hten raise an event indicating a
possible outage. Subsequent new devices that have the same name are
suppressed because the building outage event that was earlier raised
covers it.
2) defined architecture If you have a defined, known architectural
dependency chain could be deployed.
BN-< CN -< RA -< AA -< BA -< EN -< PI, WE
BN-< DC -< DA -< DE -< managed host
BN-< BR -< GR
BN => Backbone Node ; BR => Border Router ; GR => GigaPOP
Aggregation Router
DC => DataCenter Router ; DA => DataCenter Aggregator ; DE =>
DataCenter Edge
CN => Core Router Node ; RA => Regional Aggregator ; AA => Area
Aggregator ; BA => Building Aggregator
EN => Edge Node; PI => Power Injector ; WE => Wireless Edge
3) dependency based upon measured adjacency at L1/L2. This would need
to use some method such as CDP to figure out at L1. Any
unreachability/outage should raise a context (state) noting that a
device is unavailable. Given any device, you should know what the
dependency is between the device and EYE. If you treat that dependency
chain as a set, and walk for any outages in the chain. If a device is
unreachable in the chain, then you would expect that device to be the
root-cause.
Analysis:
1) appears to be too problematic and have too many guesses associated.
There is no positive indication.
2) appears to have some problematic things int that the architectural
building block may or may not be deployed. So if an edge plugs directly
into a core, or more likely other aggregation such as area and regional
are not deployed then those aggregation points are undefined.
3) This looks to be the best, but would still fall down in the case of
parallel infrastructure. In this case it would be fairly easy to get
around that failure if one were modify the dependency chain to take into
account all of the measured parallel infrastructure. In this case, you
would represent as:
[ BN|BN] -< [ DC|DC] -< [DA|DA] -< [DE|DE] -< managed host
Tim
--
Tim Peiffer
Network Support Engineer
Office of Information Technology
University of Minnesota/NorthernLights GigaPOP
+1 612 626-7884 (desk)
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Simple-evcorr-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/simple-evcorr-users