Re: ECC status in FreeBSD

2004-12-20 Thread Charles Swiger
On Dec 20, 2004, at 3:55 PM, Brett Glass wrote:
I'm getting ready to build some (hopefully) high reliability servers  
with
ECC memory. I'd like to put FreeBSD on them. What facilities (if any)  
does
FreeBSD have for:

1) Reporting the status of ECC memory (errors corrected, errors  
uncorrected, etc.)?
2) Responding to uncorrectable errors?
A quick check of the archives suggests a FreeBSD version of a kernel  
module which pays attention to the ECC status of various chipsets is  
available from:

http://docs.freebsd.org/cgi/getmsg.cgi?fetch=113348+0+archive/2001/ 
freebsd-hackers/20010318.freebsd-hackers

...based on the work for Linux at:
http://www.anime.net/~goemon/linux-ecc/
3) Mapping out portions of memory that produce repeated errors?
You can set an option in the loader to limit the physical memory  
available to FreeBSD, which could serve the purpose.

However, your RAM isn't a hard drive, so the ad-sector remapping used  
by hard drives is not fully applicable.  Your machine is expected not  
to have any part of memory fail reproducably, but if you do, it's time  
to use the warranty and replace the entire chip.

It seems to me that, for an operating system that prides itself on  
server
stability and performance, such features are a must.
ECC is a fine idea, but the motherboard chipset pretty much does  
everything that is required (except for the reporting/syslogging), so  
the kernel doesn't need to be specially involved for the system to  
benefit from ECC protection.

--
-Chuck
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ECC status in FreeBSD

2004-12-20 Thread Brett Glass
At 03:25 PM 12/20/2004, Charles Swiger wrote:

However, your RAM isn't a hard drive, so the ad-sector remapping used  
by hard drives is not fully applicable.  Your machine is expected not  
to have any part of memory fail reproducably, but if you do, it's time  
to use the warranty and replace the entire chip.

It's true that RAM is not a hard drive. However, if the problem is with
certain memory cells rather than, say, the row or column drivers, the
rest of the chip is usable. And if you did want to scuttle the entire
module on which the chip resided, you'd probably want to disable that
module in the meantime by telling the system not to use it. Certainly,
you'd at least want to know which module was failing. There's nothing
to tell you that right now.

ECC is a fine idea, but the motherboard chipset pretty much does  
everything that is required (except for the reporting/syslogging), so  
the kernel doesn't need to be specially involved for the system to  
benefit from ECC protection.

Alas, right now there's no way to KNOW that you need to deal with a 
failing RAM module until you start experiencing random and possibly
destructive system panics or crashes. It'd be nice, at least, to see 
something in the logs or be able to collect statistics from the 
motherboard.

--Brett

___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]