RE: Network Reliability Engineering

2002-05-20 Thread Randy Neals



While it is possible to get the FIT numbers for hardware and calculate
network availability, our experience has been that modelling hardware
reliability and calculating network availability was not particularly
usefull as hardware and fiber transmission systems are usually the least
signifigant factor in overall network availability. Hardware failures are
also easy to design around by redundant hardware, or more boxes, or diverse
fiber routes.

Network software issues and Operational mistakes seem to affect Network
Availability more than hardware.

An example would be a bug in a routing protocol that causes an erroneous
update to propagate through the network. Or in the operational category, a
typo which causes unintended results.

In both cases these failures are not limited to one box, but often cause
problems or their effects to propagate throughout the entire network.


How do you objectively calculate the network availability when the network
is highly dependant upon the correct functioning of a binary blob of
proprietary code, but your only visability inside the blob is a release note
listing the symptoms experienced by others who have run the code in a
similar, but probably not identical network configuration?

It seems unlikely that vendors are going to disclose more about their
proprietary blob of binary to protect their I.P. assets. This leaves teh
netwrok operator without much to assess code reliability.

Perhaps we need to change the business model around network code licensing
to ensure vendors comprehend the impact of a bad release, and share the pain
when they release a buggy blob that has customer impact on the network.

Rather than a one-time fee to license the code when you buy the box, a small
recurring monthly license fee, with no payment in any month that a software
bug crashes your network, would act as a continuous form of positive
reinforcement for your box vendor to ensure your network has high
availability code.

The box vendor would have a recurring revenue stream for software licensing
that is only as stable and reliable as their software.

-R


-Original Message-
From: Pete Kruckenberg
To: [EMAIL PROTECTED]
Sent: 5/18/2002 7:13 PM
Subject: Network Reliability Engineering


I'm looking for some good reference materials to do some
reliability engineering calculations and projections.

This is to justify increased redundancy, and I want to
include quantifiable numbers based on MTBF data and other
reliability factors, kind of a scientific justification
instead of just the typical emotional appeal using
analyst/vendor FUD.

I'd appreciate references on how to do this in a network
environment (what data to collect, how to collect it, how to
analyze, etc). Also any data (or rules of thumb) on typical
MTBFs for network events that I won't find on vendor product
slicks (like what's the MTBF on IOS, or human-caused service
outages of various types, etc).

If someone has put together something remotely like this
that they'd care to share, that'd be incredibly helpful.

Thanks.
Pete.




Re: Network Reliability Engineering

2002-05-19 Thread Nigel Clarke


Try the The Art of Testing Network Systems 

ISBN: 0-471-13223-3

---

Nigel Clarke
Network Security Engineer
[EMAIL PROTECTED] 




Re: Network Reliability Engineering

2002-05-18 Thread Ralph Doncaster


Good luck.  For a proper scientific analysis you'd need MTBF info on every
point of failure - i.e. the physical link, CSU/DSU, power supply, ...
As a rather non-scientific observation, a couple outages per year of 1-4
hours seems to be quite common for a single-homed T1 or faster connection,
be it from WorldCom, ATT, Sprint...

I think the arguments in favor of dual-homing are pretty cut and
dry.  Tri-homing vs dual-homing would be a much tougher benefit to
quantify.

Ralph Doncaster
principal, IStop.com 
div. of Doncaster Consulting Inc.

On Sat, 18 May 2002, Pete Kruckenberg wrote:

 
 I'm looking for some good reference materials to do some
 reliability engineering calculations and projections.
 
 This is to justify increased redundancy, and I want to
 include quantifiable numbers based on MTBF data and other
 reliability factors, kind of a scientific justification
 instead of just the typical emotional appeal using
 analyst/vendor FUD.
 
 I'd appreciate references on how to do this in a network
 environment (what data to collect, how to collect it, how to
 analyze, etc). Also any data (or rules of thumb) on typical
 MTBFs for network events that I won't find on vendor product
 slicks (like what's the MTBF on IOS, or human-caused service
 outages of various types, etc).
 
 If someone has put together something remotely like this
 that they'd care to share, that'd be incredibly helpful.
 
 Thanks.
 Pete.
 
 
 




Fwd: RE: Network Reliability Engineering

2002-05-18 Thread blitz



AHH, MTBF date from vendorswell, there goes the idea of THAT project. 
You'll find that data, IF you can find it, will be calculated by sales 
cretins, not engineers.




Check out this book:

  High-Availability Network Fundamentals
  Cisco Press
  ISBN 1-58713-017-3

Despite its Cisco Press origin, the book is 99% vendor-neutral and applies
to any equipment. It helps you calculate MTBF-based availability of entire
network paths, factoring in various types of redundancy. You're on your own
collecting actual MTBF data from vendors, but this book may help you put it
together into something sensible.