> On Thu, 6 Feb 2003 14:13, Jason Lim wrote: > > I was wondering what kind of failures you experience with long-running > > hardware. > > I don't recall seeing a computer that had been in service for more than 3 > months fail in any way not associated with movement. Moving parts (fans and > hard drives) die. Expansion boards and motherboards can die if they are > moved or tweaked.
Well, these systems are treated like kings (or you could say like a babe). They are almost never shut down, touched, nor anything else beyond cleaning the air filters. They are rackmount servers so the filters are in the front and can be removed/cleaned/replaced without doing anything to the system itself or moving anything (except the filter). > > If you only clean air filters while leaving the machine in place, and if the > fans are all solid with ball bearings then it should keep running for many > years. We know sleeve-bearing fans die pretty quickly and that ball-bearing fans tend to keep running much better/longer, but do you know approximately how long we're talking about? Is "long" 3 years, 5 years, or possibly even longer than that? I know we're talking about something that is pretty variable, but I suppose the better question is: How long should one go before replacing a ball-bearing fan? >From what others have said here fans seem to be the weakest point in a server (with hard disks being the second), and since fans in some critical places are VERY important (eg. CPU fan), a failure in those locations could cause even more downtime than a failed hard disk, since hard disks can have redundancy (RAID 1,5,etc.) but I've rarely heard people talk of "redundant fans". Even in expensive DELL and HP servers there are usually only 1 fan on each CPU. > > Most of us run servers with very long uptimes (we've got a server here > > with uptime approaching 3 years, which is not long compared to some, but > > we think it is pretty good!). > > I think that's a bad idea. I've never seen a machine with an uptime of >1 > year boot correctly. In my experience after more than a year of running > someone will have changed something that makes either the OS or the important > applications fail to start correctly and will have forgotten what they did > (or left the company). Everyone that has worked for us has stayed with us, fortunately :-) Everyone also keeps a log of what is done to a system. The only thing I've seem happen to such a system is that when it is booted, fsck auto checks the system and usually comes up with hundreds of errors or something. But usually after those are fixed it boots up okay. > > Most of these servers either have 3ware RAID cards, or have some other > > sort of RAID (scsi, ide, software, etc.). The hard disks are replaced as > > they fail, so by now some RAID 1 drives are actually 40Gb when only about > > 20Gb is used, because the RAID hardware cannot "extend" to use the extra > > size (but this is a different issue). > > Software RAID can deal with this. I will investigate this further. > > Now... we can replace all the fans in the systems (eg. CPU fan, case fans, > > etc.). Some even suggested we jimmy on an extra fan going sideways on the > > CPU heatsick, so if the top fan fails at least airflow is still being > > pushed around which is better than nothing (sort of like a redundant CPU > > fan system). > > Not a good idea for a server system. Servers are designed to have air flow in > a particular path through the machine. Change that in any way and you might > get unexpected problems. For the non-brand-name rackmount servers, they usually aren't _that_ well designed. Many of them can accomodate a variety of motherboard types, which means that the location of the CPU is variable and could be located in numerous places inside the chassis. Thus they can't assume the CPU will be definitely in a particular place. I'm guessing that as long as the CPU fan blows down and towards the rear of the server (exhaust) that it follows with the general airflow of the system. Even for expensive systems from DELL and such, they usually have the general direction of the airflow going back. We could simply "enhance" this effect by putting a sideways fan on the CPU pointing backwards. All of these systems are at least 3U systems, so "heat" is not as critical as compared with 1U or similar systems. > > But how about the motherboards themselves? Is it often for something on > > the motherboard to fail, after 3-4 years continuous operation without > > failure? > > I've only seen motherboards fail when having RAM, CPUs, or expansion cards > upgraded or replaced. > > I've heard of CPU and RAM failing, but only in situations where I was not > confident that they had not been messed with. So I guess it is safe to say that in genreal, that CPU and RAM will not fail provided they are not tampered with. > > We keep the systems at between 18-22 degrees celcius (tending towards the > > lower end) as we've heard/read somewhere that for every degree drop in > > temperature, hardware lifetime is extended by X number of years. Not sure > > if that is still true? > > Also try to avoid changes in temperature. Thermal expansion is a problem. > Try to avoid having machines turned off for any period of time. If working > on a server with old hard drives power the drives up and keep them running > unattached to the server while you are working for best reliability. Turning > an old hard drive off for 30 minutes is regarded as being a great risk. Temperature is usually kept constant. Depending on where the server is in the rack (top/bottom) the temperature tends to vary (top tends to be between 2-3 degrees higher than bottom). On that note, what temperature difference do you observe between top/bottom? > > But the best thing to do is regularly replace hard drives. Hard drives more > than 3 years old should be thrown out. The best thing to do is only buy > reasonably large hard drives (say a minimum of 40G for IDE and 70G for SCSI). > Whenever a hard drive seems small it's probably due to be replaced. The thing that really bites is that "40Gb" hard disks from different manuacturers seem to have quite different formatted capacities... heck, we've seen different capacities from the same manufacturer but slightly different model numbers (but the same model)! I guess one way would be to pre-purchase a whole bunch of matching-size drives, but then you run the risk of only using them a couple of years later, and then they might not start up at that time :-/ Got any suggestion to get around the above? Thanks for the info! -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

