Re: Chances of failure [7:51788]

Howard C. Berkowitz Tue, 20 Aug 2002 22:08:27 -0700

At 9:17 PM +0000 8/20/02, Tribavan Raina wrote:
>Hi all,
>
>
>I have a customer who is using Cat 4006 as a backbone switch.The customer
>needs to find out the some questions regarding the failure of this switch.
>
>4)  what are the risks and likelyhood of component or delivery failure.
>
>4) what ways can we i) reduce the likelyhood of failure in the first
>instance
>
>
>Hope some of you experienced guys can help meout.Does cisco provide MTBF
>stats about the devices or not.
>
>It is urgent plz help me out.
>
In designing high-availability systems, I haven't found MTBF to be 
terribly useful.  MTBF is useful when you are dealing with a large 
number of identical systems (e.g., brakes on a car), but when you are 
dealing with samples of one or two devices, it's too small a sample 
to depend on it.


What I find much more useful is MTTR. Discuss with your customer what 
the cost of downtime would be, and then consider the amount of 
downtime that would result from various failure modes.  Let's say the 
supervisor board fails. Does the customer have people 24/7 that are 
qualified to replace it if it fails?  Does the customer keep a spare? 
If it will take 24 hours to get a replacement, multiply the hourly 
cost of downtime by 24.

Again, when you are dealing with single or small numbers of 
components, the specific subcomponent that will fail is fairly 
unpredictable. It's often simpler to have a backup box than to guess 
the spares you will need and have hardware-qualified technicians 
always available.

In considering reliability, also consider such things as maintenance, 
including software upgrades. At the moment, I'm designing a 
high-availability system for a clinical medical customer, and the 
number of devices needed at a critical point is:

           P + B + M

where P is the number of devices needed to handle the normal production load,
       B is usually 1, but is the number of devices on hot standby
       M, almost always 1, is a device available for maintenance.

If you have several clusters of identical and colocated equipment, 
the M, but not the B, devices can be shared across some reasonable 
number of clusters.

Returning to your question about reducing failure:

    1.  Good power, filtered and uninterruptible (not just battery backup,
        but physically protected so the janitor doesn't unplug it to use
        the floor scrubber)
    2.  Proper environmental controls, certainly including temperature and
        humidity, but also local hazards -- vibration, etc.  Be sure cooling
        input and hot air output can't be blocked by other equipment.
    3.  Screw down connectors and/or tie-wrap them.
    4.  Log all changes, preferably on a write-once (e.g., CD-R) syslog
    5.  Use care on who has enable passwords, and change periodically
    6.  Always have a TFTP server available
    7.  Don't rush to use the latest software release unless you absolutely
        need a new feature, or the release fixes a bug




Message Posted at:
http://www.groupstudy.com/form/read.php?f=7&i=51817&t=51788
--------------------------------------------------
FAQ, list archives, and subscription info: http://www.groupstudy.com/list/cisco.html
Report misconduct and Nondisclosure violations to [EMAIL PROTECTED]

Re: Chances of failure [7:51788]

Reply via email to