Krugger wrote:
Hi,
We are currently deploying Tyan S2882 Dual Opteron Boards, and we have
found the system to be quite unstable. After BIOS updates and kernel
Unstable when? When idle? Under heavy cpu load? Under heavy I/O?
During Install? Which OS/Dist/Kernel?
changes we still get random kernel panics when under load.
What kind of load? How big is the power supply? What kind of CPU?
Anyone has these boards and has found any solution, as I have mailed
other users of this board who also reported random kernel panics and
an unusual number of hardware problems.
How many are unreliable? 1 of 1? 10 of 10? 64 of 64?
So far we have solved the
- broken BIOS problem with an update to the most recent BIOS.
- Discovered that some power supplies can produce problems
http://www.anandtech.com/mb/showdoc.aspx?i=2608
Power supplies do degrade over time, especially if overloaded.
- FS corruption due to a firmeware problem in a RAID hardware board
Indeed, hardware RAID problems seem shockingly common..
- MCE chipkill errors (non-fatal) due to apparent bad RAM
Detected how? New memory passed 24 hours with memtest86? Are you using
ram certified as compatible with the 2882?
To be solved:
- random kernel panics that take out the logging even when all debug
flags are set in the kernel, as it fails to sync the disc during the
kernel panic.
Could log it to serial.
I've got at least 32 of these, and they seem pretty reliable.
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf