Re: ECC support

2015-09-15 Thread Konstantin Belousov
On Wed, Sep 16, 2015 at 12:14:00AM +0300, Andriy Gapon wrote:
> On 15/09/2015 23:53, Dieter BSD wrote:
> > Assuming that a board does have the necessary connections but
> > the firmware does not have ECC support, is there some reason that
> > ECC support could not be added to the OS instead of the firmware?
> 
> Yes, there is.  The memory controller is programmed by the code that runs from
> ROM and uses no RAM (or the CPU cache is used as the RAM).  Once the real RAM
> gets used it's too late to reprogram the DRAM controller.  This is true at 
> least
> for most or all of the modern day x86 hardware.

For modern Intel hardware, the IMC config is locked before BIOS passes
the control to the user code, i.e. OS loader. It does not help much that
the documentation for IMC is not provided even under NDA.
___
freebsd-hardware@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hardware
To unsubscribe, send any mail to "freebsd-hardware-unsubscr...@freebsd.org"


Re: ECC support

2015-09-15 Thread Don Lewis
On 15 Sep, Jim Thompson wrote:
> 
>> On Sep 15, 2015, at 5:19 PM, Igor Mozolevsky 
>> wrote:
>> 
>> On 15 September 2015 at 22:52, Jim Thompson > > wrote:
>> 
>> 
>> 
>> Errors are corrected "on-the-fly," corrected data is almost never
>> placed back in memory. If the same corrupt data is read again, the
>> correction process is repeated. Replacing the data in memory would
>> require processing overhead that could accumulate and significantly
>> diminish system performance. If the error occurred because of random
>> events and isn't a defect in the memory, the memory address will be
>> cleaned of the error when the data is overwritten with other data.
>> 
>>  
>> 
>> Just to correct a small oversight- most (if not all?) boards have an
>> option to scrub ECC memory in the background so as to prevent single
>> bit (recoverable) errors from turning into double bit (irrecoverable
>> but detectable) errors ;-)
> 
> I think you’ll find that the default for ‘scrub’ is off on most
> (perhaps all) boards.  There are reasons, and these relate directly to
> “significantly diminish system performance”, (above), as well as the
> greatly increased RAM sizes in use today.

The Gigabyte AM3+ motherboards that I'm using have all sorts of knobs
for controlling the scrub rate, with different knobs for cache scrubbing
vs. main memory scrubbing.  My somewhat more recent Asus AM3+ board with
different BIOS brand basically just has an ECC on/off knob.

___
freebsd-hardware@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hardware
To unsubscribe, send any mail to "freebsd-hardware-unsubscr...@freebsd.org"


Re: ECC support

2015-09-15 Thread Don Lewis
On 15 Sep, Jim Thompson wrote:
> 
>> On Sep 15, 2015, at 5:10 PM, Don Lewis  wrote:
>> 
>> On 15 Sep, Dieter BSD wrote:
>>> Many of AMD's CPU/APU parts support ECC memory.  Not just the top of the
>>> line parts, but also many of the less expensive, less power hungry parts.
>>> However, many (most?) of the boards for these chips do not support ECC,
>>> or at least do not admit to it.  They specify "non-ECC memory".
>>> 
>>> Obviously there have to be connections between the memory controller and
>>> the memory for the extra bits.  Aside from a little extra time for the
>>> board designer to add a few traces to the wire list, this would not
>>> raise the cost of the board.  Despite this I have read that some boards
>>> lack the necessary traces.
>> 
>> I don't think the current APU parts support ECC.  My guess is that the
>> current APU sockets don't have the connections to support it.
> 
> The G-Series (such as the T40E used on the APU) doesn’t support ECC.
> 
> “Kabini” (“G-Series 2.0” aka GX-210 / GX-415/420) supports a single channel 
> of ECC ram.

Interesting ... it's been a while since I looked.  I think the primary
sockets at the time were FM1, FM2, and FM2+, and the mobile sockets, and
they didn't support ECC.

AM1 motherboard ECC support seems to be pretty lacking, though.


___
freebsd-hardware@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hardware
To unsubscribe, send any mail to "freebsd-hardware-unsubscr...@freebsd.org"


Re: ECC support

2015-09-15 Thread alex.burlyga.ietf alex.burlyga.ietf
On Tue, Sep 15, 2015 at 3:52 PM, Igor Mozolevsky  wrote:
> On 15 September 2015 at 23:34, Jim Thompson  wrote:
>
> 
>
>
>> I think you’ll find that the default for ‘scrub’ is off on most (perhaps
>> all) boards.  There are reasons, and these relate directly to
>> “significantly diminish system performance”, (above), as well as the
>> greatly increased RAM sizes in use today.
>>
>
> Perhaps I missed something- what point is it that you're trying to make? I
> was saying that scrubbing aims to remove errors at the source (cf. "on
> demand") and prevent multi-bit errors that become detectable but
> irrecoverable, or worse, undetectable. Get hit by a few of the latter two
> at "interesting" points and you'd wish that scrubbing were on!
>
> And seriously, ECC scrubbing is slow but ZFS (or even hardware RAID)
> scrubbing is lightning fast??! C'mon are we going for data integrity or
> speed here?!

If I remember correctly enabling Patrol Scrub guaranties that each
address gets hit once per 24 hours. So on 128GB system you are
generating maybe 1-2MiB/s of reads. I'd say it's a good trade-off if
you bothered to put ECC memory in.

>
> ’Scrub' was popular about a decade ago, when DDR2 RAM was around $100/GB.
>> DDR3-1600 is about $6/GB today.
>>
>
> Yup- with a much higher density of smaller memory bits! ;-)
>
> --
> Igor M.
> ___
> freebsd-hack...@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
___
freebsd-hardware@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hardware
To unsubscribe, send any mail to "freebsd-hardware-unsubscr...@freebsd.org"

Re: ECC support

2015-09-15 Thread Igor Mozolevsky
On 15 September 2015 at 23:34, Jim Thompson  wrote:




> I think you’ll find that the default for ‘scrub’ is off on most (perhaps
> all) boards.  There are reasons, and these relate directly to
> “significantly diminish system performance”, (above), as well as the
> greatly increased RAM sizes in use today.
>

Perhaps I missed something- what point is it that you're trying to make? I
was saying that scrubbing aims to remove errors at the source (cf. "on
demand") and prevent multi-bit errors that become detectable but
irrecoverable, or worse, undetectable. Get hit by a few of the latter two
at "interesting" points and you'd wish that scrubbing were on!

And seriously, ECC scrubbing is slow but ZFS (or even hardware RAID)
scrubbing is lightning fast??! C'mon are we going for data integrity or
speed here?!

’Scrub' was popular about a decade ago, when DDR2 RAM was around $100/GB.
> DDR3-1600 is about $6/GB today.
>

Yup- with a much higher density of smaller memory bits! ;-)

-- 
Igor M.
___
freebsd-hardware@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hardware
To unsubscribe, send any mail to "freebsd-hardware-unsubscr...@freebsd.org"

Re: ECC support

2015-09-15 Thread Jim Thompson

> On Sep 15, 2015, at 5:10 PM, Don Lewis  wrote:
> 
> On 15 Sep, Dieter BSD wrote:
>> Many of AMD's CPU/APU parts support ECC memory.  Not just the top of the
>> line parts, but also many of the less expensive, less power hungry parts.
>> However, many (most?) of the boards for these chips do not support ECC,
>> or at least do not admit to it.  They specify "non-ECC memory".
>> 
>> Obviously there have to be connections between the memory controller and
>> the memory for the extra bits.  Aside from a little extra time for the
>> board designer to add a few traces to the wire list, this would not
>> raise the cost of the board.  Despite this I have read that some boards
>> lack the necessary traces.
> 
> I don't think the current APU parts support ECC.  My guess is that the
> current APU sockets don't have the connections to support it.

The G-Series (such as the T40E used on the APU) doesn’t support ECC.

“Kabini” (“G-Series 2.0” aka GX-210 / GX-415/420) supports a single channel of 
ECC ram.

Honestly, at the densities used by some of these boards, ECC doesn’t make much 
sense.
(Obviously, if you’re running storage appliance, this position is reversed.)


___
freebsd-hardware@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hardware
To unsubscribe, send any mail to "freebsd-hardware-unsubscr...@freebsd.org"

Re: ECC support

2015-09-15 Thread Jim Thompson

> On Sep 15, 2015, at 5:19 PM, Igor Mozolevsky  wrote:
> 
> On 15 September 2015 at 22:52, Jim Thompson  > wrote:
> 
> 
> 
> Errors are corrected "on-the-fly," corrected data is almost never placed back 
> in memory. If the same corrupt data is read again, the correction process is 
> repeated. Replacing the data in memory would require processing overhead that 
> could accumulate and significantly diminish system performance. If the error 
> occurred because of random events and isn't a defect in the memory, the 
> memory address will be cleaned of the error when the data is overwritten with 
> other data.
> 
>  
> 
> Just to correct a small oversight- most (if not all?) boards have an option 
> to scrub ECC memory in the background so as to prevent single bit 
> (recoverable) errors from turning into double bit (irrecoverable but 
> detectable) errors ;-)

I think you’ll find that the default for ‘scrub’ is off on most (perhaps all) 
boards.  There are reasons, and these relate directly to “significantly 
diminish system performance”, (above), as well as the greatly increased RAM 
sizes in use today.

’Scrub' was popular about a decade ago, when DDR2 RAM was around $100/GB.  
DDR3-1600 is about $6/GB today.

Jim


___
freebsd-hardware@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hardware
To unsubscribe, send any mail to "freebsd-hardware-unsubscr...@freebsd.org"

Re: ECC support

2015-09-15 Thread Igor Mozolevsky
On 15 September 2015 at 22:52, Jim Thompson  wrote:



Errors are corrected "on-the-fly," corrected data is almost never placed
> back in memory. If the same corrupt data is read again, the correction
> process is repeated. Replacing the data in memory would require processing
> overhead that could accumulate and significantly diminish system
> performance. If the error occurred because of random events and isn't a
> defect in the memory, the memory address will be cleaned of the error when
> the data is overwritten with other data.
>



Just to correct a small oversight- most (if not all?) boards have an option
to scrub ECC memory in the background so as to prevent single bit
(recoverable) errors from turning into double bit (irrecoverable but
detectable) errors ;-)


-- 
Igor M.
___
freebsd-hardware@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hardware
To unsubscribe, send any mail to "freebsd-hardware-unsubscr...@freebsd.org"


Re: ECC support

2015-09-15 Thread Don Lewis
On 15 Sep, Dieter BSD wrote:
> Many of AMD's CPU/APU parts support ECC memory.  Not just the top of the
> line parts, but also many of the less expensive, less power hungry parts.
> However, many (most?) of the boards for these chips do not support ECC,
> or at least do not admit to it.  They specify "non-ECC memory".
> 
> Obviously there have to be connections between the memory controller and
> the memory for the extra bits.  Aside from a little extra time for the
> board designer to add a few traces to the wire list, this would not
> raise the cost of the board.  Despite this I have read that some boards
> lack the necessary traces.

I don't think the current APU parts support ECC.  My guess is that the
current APU sockets don't have the connections to support it.

I'm typing on a FreeBSD with an AMD CPU with ECC RAM.  I won't put
together a machine without ECC.  My experience is that many ASUS
motherboard support ECC RAM and usually document that fact.  Also many
Gigabyte mother boards also support ECC RAM, but don't document it. Even
if you look at the BIOS screenshots in the manual, you won't see the
knobs to configure ECC, I suspect because those knobs are not displayed
unless ECC RAM is installed.

> Does the firmware have to do anything to support ECC?  Program a few
> registers in the memory controller perhaps?  A few boards have FLOSS
> firmware available, so this code could be added, but most boards do not
> have firmware sources available.
> 
> Assuming that a board does have the necessary connections but
> the firmware does not have ECC support, is there some reason that
> ECC support could not be added to the OS instead of the firmware?
> I grepped through FreeBSD 8.2 and 10.1 sources but couldn't find
> anything that looked relevant.  Also did not find any code that
> reported ECC errors, other than one device.  Perhaps I missed it?

It's in there ...

> I've been running machines with ECC for 15-20 years and have never seen
> a report of an ECC error from either NetBSD or FreeBSD.  I have seen
> reports of ECC errors from Digital Unix.  And remember getting panics
> due to parity errors on machines before ECC.  So I'm thinking that
> the BSDs must ignore hardware reports of single bit ECC errors.  :-(

>From daily mail to root about a month ago:

+MCA: Bank 4, Status 0x944a400096080a13
+MCA: Global Cap 0x0106, Status 0x
+MCA: Vendor "AuthenticAMD", ID 0x100f53, APIC ID 0
+MCA: CPU 0 COR BUSLG Responder RD Memory
+MCA: Address 0x213e98b10
+MCA: Bank 4, Status 0xd44a400096080a13
+MCA: Global Cap 0x0106, Status 0x
+MCA: Vendor "AuthenticAMD", ID 0x100f53, APIC ID 0
+MCA: CPU 0 COR OVER BUSLG Responder RD Memory
+MCA: Address 0x213e98b10

___
freebsd-hardware@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hardware
To unsubscribe, send any mail to "freebsd-hardware-unsubscr...@freebsd.org"


Re: ECC support

2015-09-15 Thread Jim Thompson

ECC is implemented by a ‘hashing’ algorithm that works on eight (8) bytes (64 
bits) at a time, and places the result into an 8-bit ECC ‘word’.

Errors are corrected "on-the-fly," corrected data is almost never placed back 
in memory. If the same corrupt data is read again, the correction process is 
repeated. Replacing the data in memory would require processing overhead that 
could accumulate and significantly diminish system performance. If the error 
occurred because of random events and isn't a defect in the memory, the memory 
address will be cleaned of the error when the data is overwritten with other 
data.

In terms of expense, at a minimum, where you had 8 bytes to make up a memory 
system, you will now have 9 (to hold the extra 8 bits).  This means your 
memory, without the extra complexity of the controller, is 12.5% more 
expensive.   This isn’t a huge impact at 8GB, (you’ll need another 1GB of RAM), 
but at 1024GB you’ll need another 128GB, and that much ram still costs enough 
that your wallet won’t be happy.  

The memory controller has to be able to run the ECC algorithm on every read, 
*and* supply the corrected data as needed, within the cycle time of the read.  
If you involve software in this path, the performance your machine will be 
glacial.

Yes, the firmware has to program the memory controller.   “Program a few 
registers” is all you need, only the MRC setup on Intel and AMD is both complex 
and proprietary.  Good luck getting the
details for this.  This is “Intel Red Book” territory, and you’ll need to be an 
employee with a need to know.  The MRC setup code is a binary blob for 
otherwise open source boot firmware such as Coreboot.

Others have answered (in the positive) about the OS reporting ECC errors on 
FreeBSD.

Jim

> On Sep 15, 2015, at 3:53 PM, Dieter BSD  wrote:
> 
> Many of AMD's CPU/APU parts support ECC memory.  Not just the top of the
> line parts, but also many of the less expensive, less power hungry parts.
> However, many (most?) of the boards for these chips do not support ECC,
> or at least do not admit to it.  They specify "non-ECC memory".
> 
> Obviously there have to be connections between the memory controller and
> the memory for the extra bits.  Aside from a little extra time for the
> board designer to add a few traces to the wire list, this would not
> raise the cost of the board.  Despite this I have read that some boards
> lack the necessary traces.
> 
> Does the firmware have to do anything to support ECC?  Program a few
> registers in the memory controller perhaps?  A few boards have FLOSS
> firmware available, so this code could be added, but most boards do not
> have firmware sources available.
> 
> Assuming that a board does have the necessary connections but
> the firmware does not have ECC support, is there some reason that
> ECC support could not be added to the OS instead of the firmware?
> I grepped through FreeBSD 8.2 and 10.1 sources but couldn't find
> anything that looked relevant.  Also did not find any code that
> reported ECC errors, other than one device.  Perhaps I missed it?
> 
> I've been running machines with ECC for 15-20 years and have never seen
> a report of an ECC error from either NetBSD or FreeBSD.  I have seen
> reports of ECC errors from Digital Unix.  And remember getting panics
> due to parity errors on machines before ECC.  So I'm thinking that
> the BSDs must ignore hardware reports of single bit ECC errors.  :-(
> ___
> freebsd-hack...@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

___
freebsd-hardware@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hardware
To unsubscribe, send any mail to "freebsd-hardware-unsubscr...@freebsd.org"

Re: ECC support

2015-09-15 Thread Andriy Gapon
On 15/09/2015 23:53, Dieter BSD wrote:
> Assuming that a board does have the necessary connections but
> the firmware does not have ECC support, is there some reason that
> ECC support could not be added to the OS instead of the firmware?

Yes, there is.  The memory controller is programmed by the code that runs from
ROM and uses no RAM (or the CPU cache is used as the RAM).  Once the real RAM
gets used it's too late to reprogram the DRAM controller.  This is true at least
for most or all of the modern day x86 hardware.

-- 
Andriy Gapon
___
freebsd-hardware@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hardware
To unsubscribe, send any mail to "freebsd-hardware-unsubscr...@freebsd.org"


Re: ECC support

2015-09-15 Thread Xin Li
On 09/15/15 13:53, Dieter BSD wrote:
> I've been running machines with ECC for 15-20 years and have never seen
> a report of an ECC error from either NetBSD or FreeBSD.  I have seen
> reports of ECC errors from Digital Unix.  And remember getting panics
> due to parity errors on machines before ECC.  So I'm thinking that
> the BSDs must ignore hardware reports of single bit ECC errors.  :-(

I'm not sure about NetBSD but FreeBSD supports reporting ECC errors (via
Machine Check Architecture, added in 2009), and yes we have seen it in
field.

Cheers,
-- 
Xin LI https://www.delphij.net/
FreeBSD - The Power to Serve!   Live free or die



signature.asc
Description: OpenPGP digital signature


ECC support

2015-09-15 Thread Dieter BSD
Many of AMD's CPU/APU parts support ECC memory.  Not just the top of the
line parts, but also many of the less expensive, less power hungry parts.
However, many (most?) of the boards for these chips do not support ECC,
or at least do not admit to it.  They specify "non-ECC memory".

Obviously there have to be connections between the memory controller and
the memory for the extra bits.  Aside from a little extra time for the
board designer to add a few traces to the wire list, this would not
raise the cost of the board.  Despite this I have read that some boards
lack the necessary traces.

Does the firmware have to do anything to support ECC?  Program a few
registers in the memory controller perhaps?  A few boards have FLOSS
firmware available, so this code could be added, but most boards do not
have firmware sources available.

Assuming that a board does have the necessary connections but
the firmware does not have ECC support, is there some reason that
ECC support could not be added to the OS instead of the firmware?
I grepped through FreeBSD 8.2 and 10.1 sources but couldn't find
anything that looked relevant.  Also did not find any code that
reported ECC errors, other than one device.  Perhaps I missed it?

I've been running machines with ECC for 15-20 years and have never seen
a report of an ECC error from either NetBSD or FreeBSD.  I have seen
reports of ECC errors from Digital Unix.  And remember getting panics
due to parity errors on machines before ECC.  So I'm thinking that
the BSDs must ignore hardware reports of single bit ECC errors.  :-(
___
freebsd-hardware@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hardware
To unsubscribe, send any mail to "freebsd-hardware-unsubscr...@freebsd.org"