Re: Hardware failure -- how to find out?

Marty Wed, 07 Mar 2007 17:58:34 -0800

Cassiano Leal wrote:

People,
I have at work a mixed system Debian sarge/etch running a firewall andvpn server on a K6-2.
Today, we were experiencing some connectivity problems, and we found outthat they were caused by iptables not initiating properly andsegfaulting. So, I went into the servers room, to find out that thecomputer in question was beeping constantly, which led me to believe itwas a hardware failure.

If it was the speaker then it could mean the motherboard was resetting whilestill in the BIOS.

The computer wouldn't respond to 'init 6' via ssh, so I tried toCtrl-Alt-Delete locally, without success. Tried to login, again to fail.

You probable got a kernel oops or your root hard drive went offline. A kerneloops will usually make it to the console but not to the logs. It's usuallycaused by CPU overheating or memory problems, and less commonly by afailing power supply or bad motherboard capacitors. If you have an oldmotherboard (over 5 years old) check for bulged or burned out capacitors. Makesure your fans are working, and clear of dust.

If your hard drive went offline it could be a failing drive, bad data cable orbad power connector. If the drive is S.M.A.R.T.-aware you can check it withsmartctl. You can also surface scan with fsck or better, download and run themanufacturer's diagnostics for that drive.

So, the solution was to hit the power button and expect it to work. Tomy luck, it did straight away and we are at the moment working without aproblem.
But I couldn't yet state what have caused the problem. So, my questionis: how do I trace this failure? Log files? Which ones?

If it was caused by a hard drive error then it will show up in the/var/log/message and /var/log/syslog unless the disk went offline before anyerrors were logged. If you are ambitious you can use the kernel's core dumpdriver/module, make a core dump after the crash and examine it with gdb. Youprobably want to dump to a device than your root partition drive in case of harddrive problems.


Any other tests I can run?

I use three kinds of tests, usually at the same time to maximize stress. One isa version a bash script called "burnit" which was used to test for a K6 bug. Itdoes repeated kernel compiles and runs checksums on the object files, makingsure they are the same for each loop. This is the best single test I've foundfor stress testing PCs.

Another good test is running debsums on all installed packages. I run it withthe -c flag to minimize output, and since I have a local debian archive, I alsouse the --generate=all option to add LAN traffic.

The third test is memtest, which I run on the memory which I don't need for theother tests.

I have never had a hardware problem that wasn't revealed by these tests, butsome infrequent failures took hours or days to show up.


Please, bear in mind that this is a production firewall system.

Good reason to set up a backup firewall. Any old PC will do. The backupfirewall can serve as a replacement while you do the stress testing. Afterfixing the failing PC, you can keep keep the backup running and ready to replacethe primary firewall at any time.

--

To UNSUBSCRIBE, email to [EMAIL PROTECTED]with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

Re: Hardware failure -- how to find out?

Reply via email to