At 9:17 PM +0000 8/20/02, Tribavan Raina wrote: >Hi all, > > >I have a customer who is using Cat 4006 as a backbone switch.The customer >needs to find out the some questions regarding the failure of this switch. > >4) what are the risks and likelyhood of component or delivery failure. > >4) what ways can we i) reduce the likelyhood of failure in the first >instance > > >Hope some of you experienced guys can help meout.Does cisco provide MTBF >stats about the devices or not. > >It is urgent plz help me out. > In designing high-availability systems, I haven't found MTBF to be terribly useful. MTBF is useful when you are dealing with a large number of identical systems (e.g., brakes on a car), but when you are dealing with samples of one or two devices, it's too small a sample to depend on it.
What I find much more useful is MTTR. Discuss with your customer what the cost of downtime would be, and then consider the amount of downtime that would result from various failure modes. Let's say the supervisor board fails. Does the customer have people 24/7 that are qualified to replace it if it fails? Does the customer keep a spare? If it will take 24 hours to get a replacement, multiply the hourly cost of downtime by 24. Again, when you are dealing with single or small numbers of components, the specific subcomponent that will fail is fairly unpredictable. It's often simpler to have a backup box than to guess the spares you will need and have hardware-qualified technicians always available. In considering reliability, also consider such things as maintenance, including software upgrades. At the moment, I'm designing a high-availability system for a clinical medical customer, and the number of devices needed at a critical point is: P + B + M where P is the number of devices needed to handle the normal production load, B is usually 1, but is the number of devices on hot standby M, almost always 1, is a device available for maintenance. If you have several clusters of identical and colocated equipment, the M, but not the B, devices can be shared across some reasonable number of clusters. Returning to your question about reducing failure: 1. Good power, filtered and uninterruptible (not just battery backup, but physically protected so the janitor doesn't unplug it to use the floor scrubber) 2. Proper environmental controls, certainly including temperature and humidity, but also local hazards -- vibration, etc. Be sure cooling input and hot air output can't be blocked by other equipment. 3. Screw down connectors and/or tie-wrap them. 4. Log all changes, preferably on a write-once (e.g., CD-R) syslog 5. Use care on who has enable passwords, and change periodically 6. Always have a TFTP server available 7. Don't rush to use the latest software release unless you absolutely need a new feature, or the release fixes a bug Message Posted at: http://www.groupstudy.com/form/read.php?f=7&i=51817&t=51788 -------------------------------------------------- FAQ, list archives, and subscription info: http://www.groupstudy.com/list/cisco.html Report misconduct and Nondisclosure violations to [EMAIL PROTECTED]