Cassiano Leal wrote:
People,
I have at work a mixed system Debian sarge/etch running a firewall and
vpn server on a K6-2.
Today, we were experiencing some connectivity problems, and we found out
that they were caused by iptables not initiating properly and
segfaulting. So, I went into the servers room, to find out that the
computer in question was beeping constantly, which led me to believe it
was a hardware failure.
If it was the speaker then it could mean the motherboard was resetting while
still in the BIOS.
The computer wouldn't respond to 'init 6' via ssh, so I tried to
Ctrl-Alt-Delete locally, without success. Tried to login, again to fail.
You probable got a kernel oops or your root hard drive went offline. A kernel
oops will usually make it to the console but not to the logs. It's usually
caused by CPU overheating or memory problems, and less commonly by a
failing power supply or bad motherboard capacitors. If you have an old
motherboard (over 5 years old) check for bulged or burned out capacitors. Make
sure your fans are working, and clear of dust.
If your hard drive went offline it could be a failing drive, bad data cable or
bad power connector. If the drive is S.M.A.R.T.-aware you can check it with
smartctl. You can also surface scan with fsck or better, download and run the
manufacturer's diagnostics for that drive.
So, the solution was to hit the power button and expect it to work. To
my luck, it did straight away and we are at the moment working without a
problem.
But I couldn't yet state what have caused the problem. So, my question
is: how do I trace this failure? Log files? Which ones?
If it was caused by a hard drive error then it will show up in the
/var/log/message and /var/log/syslog unless the disk went offline before any
errors were logged. If you are ambitious you can use the kernel's core dump
driver/module, make a core dump after the crash and examine it with gdb. You
probably want to dump to a device than your root partition drive in case of hard
drive problems.
Any other tests I can run?
I use three kinds of tests, usually at the same time to maximize stress. One is
a version a bash script called "burnit" which was used to test for a K6 bug. It
does repeated kernel compiles and runs checksums on the object files, making
sure they are the same for each loop. This is the best single test I've found
for stress testing PCs.
Another good test is running debsums on all installed packages. I run it with
the -c flag to minimize output, and since I have a local debian archive, I also
use the --generate=all option to add LAN traffic.
The third test is memtest, which I run on the memory which I don't need for the
other tests.
I have never had a hardware problem that wasn't revealed by these tests, but
some infrequent failures took hours or days to show up.
Please, bear in mind that this is a production firewall system.
Good reason to set up a backup firewall. Any old PC will do. The backup
firewall can serve as a replacement while you do the stress testing. After
fixing the failing PC, you can keep keep the backup running and ready to replace
the primary firewall at any time.
--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]