Re: ECC support

2015-09-16 Thread Bob Bishop

> On 16 Sep 2015, at 12:52, Igor Mozolevsky  wrote:
> 
> On 16 September 2015 at 12:34, Bob Bishop  wrote:
> 
> 
> 
> 
>> "The best we can conclude therefore is that any chip size effect is
>> unlikely to dominate error rates given that the trends are not consistent
>> across various other confounders such as age and manufacturer.”
>> 
>> I’ll admit to talking that point up a bit but it is counterintuitive.
>> Memory designers have always been scared of cosmic rays etc but the
>> suspected effects simply have not been noticeable. Most likely as they
>> shrink features ever smaller, other factors like material purity dominate.
>> 
> 
> I saw that after I posted, and had a long ponder as to why it would be so.
> The only thing I could think of is that the fab process was(/is?) large
> enough to not worry about "nonsense" like cosmic rays  (but then I've not
> had much exposure to semi-conductor electronics theory since late 90s).
> Perhaps we're at a point where the fab process can't really shrink much
> more with DRAM due to the underlying tech (effectively many tiny RC
> circuits), which is the reason the manufacturers just stack ranks to get
> more capacity per DIMM instead of packing more in a single chip?..

Dunno. I’ll ask my tame semiconductor expert when I see him tomorrow...

> -- 
> Igor M.
> ___
> freebsd-hardware@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-hardware
> To unsubscribe, send any mail to "freebsd-hardware-unsubscr...@freebsd.org"

--
Bob Bishop
r...@gid.co.uk




___
freebsd-hardware@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hardware
To unsubscribe, send any mail to "freebsd-hardware-unsubscr...@freebsd.org"

Re: ECC support

2015-09-16 Thread Igor Mozolevsky
On 16 September 2015 at 08:51, Bob Bishop  wrote:



> - You might think that as memory density increases (ie bit cell size
shrinks), error rates would increase. Apparently this wasn’t so up to 2009
at least, see:
>
>  http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

subsection 5.1:

"… Figure 6 indicates a trend towards worse error behavior
for increased capacities, although this trend is not consis-
tent. While in some cases the doubling of capacity has a
clear negative effect (factors larger than 1 in the graph),
in others it has hardly any effect (factor close to 1 in the
graph). For example, for Platform A -Mfg1 and Platform F -
Mfg1 doubling the capacity increases uncorrectable errors,
but not correctable errors. Conversely, for Platform D -
Mfg6 doubling the capacity affects correctable errors, but
not uncorrectable error."


There are also other environmental factors which would be more apparent in
"lone-server" configuration vs well maintained and insulated data centres
with very good power conditioning ;-)


-- 
Igor M.
___
freebsd-hardware@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hardware
To unsubscribe, send any mail to "freebsd-hardware-unsubscr...@freebsd.org"

Re: ECC support

2015-09-16 Thread Igor Mozolevsky
On 16 September 2015 at 12:34, Bob Bishop  wrote:




> "The best we can conclude therefore is that any chip size effect is
> unlikely to dominate error rates given that the trends are not consistent
> across various other confounders such as age and manufacturer.”
>
> I’ll admit to talking that point up a bit but it is counterintuitive.
> Memory designers have always been scared of cosmic rays etc but the
> suspected effects simply have not been noticeable. Most likely as they
> shrink features ever smaller, other factors like material purity dominate.
>

I saw that after I posted, and had a long ponder as to why it would be so.
The only thing I could think of is that the fab process was(/is?) large
enough to not worry about "nonsense" like cosmic rays  (but then I've not
had much exposure to semi-conductor electronics theory since late 90s).
Perhaps we're at a point where the fab process can't really shrink much
more with DRAM due to the underlying tech (effectively many tiny RC
circuits), which is the reason the manufacturers just stack ranks to get
more capacity per DIMM instead of packing more in a single chip?..


-- 
Igor M.
___
freebsd-hardware@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hardware
To unsubscribe, send any mail to "freebsd-hardware-unsubscr...@freebsd.org"

Re: ECC support

2015-09-16 Thread Bob Bishop
Hi,

> On 16 Sep 2015, at 11:48, Igor Mozolevsky  wrote:
> 
> On 16 September 2015 at 08:51, Bob Bishop  wrote:
> 
> 
> 
>> - You might think that as memory density increases (ie bit cell size
> shrinks), error rates would increase. Apparently this wasn’t so up to 2009
> at least, see:
>> 
>> http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
> 
> subsection 5.1:
> 
> "… Figure 6 indicates a trend towards worse error behavior
> for increased capacities, although this trend is not consis-
> tent. [etc]

That’s talking about DIMM capacity, not the capacity (density) of individual 
chips on which they say (at the end of the same subsection):

"The best we can conclude therefore is that any chip size effect is unlikely to 
dominate error rates given that the trends are not consistent across various 
other confounders such as age and manufacturer.”

I’ll admit to talking that point up a bit but it is counterintuitive. Memory 
designers have always been scared of cosmic rays etc but the suspected effects 
simply have not been noticeable. Most likely as they shrink features ever 
smaller, other factors like material purity dominate.

> There are also other environmental factors which would be more apparent in
> "lone-server" configuration vs well maintained and insulated data centres
> with very good power conditioning ;-)

Indeed, and that’s a whole other PITA. We went to colo and never looked back, 
but low-power options for small servers are getting better.

> -- 
> Igor M.
> ___
> freebsd-hardware@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-hardware
> To unsubscribe, send any mail to "freebsd-hardware-unsubscr...@freebsd.org"

--
Bob Bishop
r...@gid.co.uk




___
freebsd-hardware@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hardware
To unsubscribe, send any mail to "freebsd-hardware-unsubscr...@freebsd.org"

Re: ECC support

2015-09-16 Thread Don Lewis
On 16 Sep, Dieter BSD wrote:
> Andriy:
>>> Assuming that a board does have the necessary connections but
>>> the firmware does not have ECC support, is there some reason that
>>> ECC support could not be added to the OS instead of the firmware?
>>
>> Yes, there is.  The memory controller is programmed by the code that
>> runs from ROM and uses no RAM (or the CPU cache is used as the RAM).
>> Once the real RAM gets used it's too late to reprogram the DRAM controller.
> 
> Perhaps one of the several bootloader stages could get itelf into
> CPU cache, program the memory controller, then load and execute the
> next stage or the OS?
> 
> Jim:
>> Replacing the data in memory would require processing overhead
>> that could accumulate and significantly diminish system performance.
> 
> If it only replaces data when there is a correctable error,
> and the errors are occasional soft errors, the effect on
> performance should be minimal.  If there is a hard error,
> you would want to replace the defective memory before you get
> an additional error and it becomes uncorrectable.
> 
>> If the error occurred because of random events and isn't a defect in
>> the memory, the memory address will be cleaned of the error when the
>> data is overwritten with other data.
> 
> If and when new data gets written to that location.  If that location
> contains info that never changes, such as kernel text, the bad bit will
> never get fixed.
> 
>> memory, without the extra complexity of the controller, is 12.5% more
>> expensive.   This <80><99>t a huge impact at 8GB, (<80><99>ll need
>> another 1GB of RAM), but at 1024GB <80><99>ll need another 128GB,
>> and that much ram still costs enough that your wallet <80><99>t be happy.
> 
> It is 12.5% in both cases.  How much does it cost to have undetected
> errors in your data?  How much does it cost when an Interstate
> bridge collapses?  How much does it cost when one of NASA's missions
> fails?  How much does it cost when your pharmacy receives a
> prescription with an error in the dose?
> 
>> the MRC setup on Intel and AMD is both complex and proprietary
> 
> One wonders why the secrecy.  AMD has been much more open than many
> (most?) chipmakers.  They even forced the ATI people to document
> how to program their chips.  I don't see a lot of companies popping up
> making competing chips.  #include standard joke: "How do you make a small
> fortune in chipmaking?  Start with a very large fortune."  I can't
> see what secret would be revealed by saying "set bit 7 of register 4
> to 1 to enable ECC".

AMD documents a lot of this stuff in the BIOS and Kernel Developer's
Guide (BKDG) for each CPU family.

>> Intel Red Book
> 
> So the secret books are red this week, yawn.  I remember the nightmare
> of the merced orange books and the brain damaged "features" the chips had.
> Not recommended.  I'm interested in chips that work correctly, hence the
> interest in ECC and AMD.  Looked for ARM boards with ECC but didn't find
> any.  Is the Sparc stuff any more reliable than it used to be?  Other
> arch choices?

Supermicro has some Atom motherboards with ECC support.

>> The MRC setup code is a binary blob for otherwise open source boot
>> firmware such as Coreboot.
> 
> So the libreboot people are forced to work on reverse engineering
> these blobs?  :-(
> 
> Don:
>> I don't think the current APU parts support ECC.
> 
> According to wikipedia, socket FM2+ does not support ECC. :-(
> Kabini has support for ECC.  And Berlin, (and I assume Toronto) but
> word is that Berlin and Toronto are basically dead. :-(
> I think Carrizo and Turion are supposed to support ECC?  There really
> ought to be a list of which CPUs/APUs/sockets/boards do or do not
> support ECC.

Socket AM1 (Kabini) is supposed to support ECC, but motherboards with
this socket that support ECC is another story.

>> My experience is that many ASUS motherboard support ECC RAM and
>> usually document that fact.  Also many Gigabyte mother boards also
>> support ECC RAM, but don't document it.
> 
> From what I've been reading, both Asus and Gigabyte make good boards.
> I've seen reviews that complained about Gigabyte's firmware.
> http://www.xbitlabs.com/articles/mainboards/display/gigabyte-ga-990fxa-ud5_8.html
> I've also seen claims that the firmware bricked boards.
> Reviewers like Asus' firmware.  I've seen complaints about Asus's support,
> and their website has significant problems.

I've got one of the Gigabyte GA_990FXA-UD5 boards.  I actually like the
BIOS.  I'm not trying to overclock, but it does have lots of ECC-related
knobs.  I think you can even tell it to gang the two memory controller
channels so that you can enable Chipkill.  The latter isn't as good as
it sounds because it really only works properly with DIMMs that us x4
DRAM chips, and there don't seem to be any unbuffered versions of those.
The only unbuffered DDR3 DIMMS I've found use x8 DRAM chips.  In that
case if a multiple bits coming out of the chip are 

Re: ECC support

2015-09-16 Thread Dieter BSD
Andriy:
>> Assuming that a board does have the necessary connections but
>> the firmware does not have ECC support, is there some reason that
>> ECC support could not be added to the OS instead of the firmware?
>
> Yes, there is.  The memory controller is programmed by the code that
> runs from ROM and uses no RAM (or the CPU cache is used as the RAM).
> Once the real RAM gets used it's too late to reprogram the DRAM controller.

Perhaps one of the several bootloader stages could get itelf into
CPU cache, program the memory controller, then load and execute the
next stage or the OS?

Jim:
> Replacing the data in memory would require processing overhead
> that could accumulate and significantly diminish system performance.

If it only replaces data when there is a correctable error,
and the errors are occasional soft errors, the effect on
performance should be minimal.  If there is a hard error,
you would want to replace the defective memory before you get
an additional error and it becomes uncorrectable.

> If the error occurred because of random events and isn't a defect in
> the memory, the memory address will be cleaned of the error when the
> data is overwritten with other data.

If and when new data gets written to that location.  If that location
contains info that never changes, such as kernel text, the bad bit will
never get fixed.

> memory, without the extra complexity of the controller, is 12.5% more
> expensive.   This <80><99>t a huge impact at 8GB, (<80><99>ll need
> another 1GB of RAM), but at 1024GB <80><99>ll need another 128GB,
> and that much ram still costs enough that your wallet <80><99>t be happy.

It is 12.5% in both cases.  How much does it cost to have undetected
errors in your data?  How much does it cost when an Interstate
bridge collapses?  How much does it cost when one of NASA's missions
fails?  How much does it cost when your pharmacy receives a
prescription with an error in the dose?

> the MRC setup on Intel and AMD is both complex and proprietary

One wonders why the secrecy.  AMD has been much more open than many
(most?) chipmakers.  They even forced the ATI people to document
how to program their chips.  I don't see a lot of companies popping up
making competing chips.  #include standard joke: "How do you make a small
fortune in chipmaking?  Start with a very large fortune."  I can't
see what secret would be revealed by saying "set bit 7 of register 4
to 1 to enable ECC".

> Intel Red Book

So the secret books are red this week, yawn.  I remember the nightmare
of the merced orange books and the brain damaged "features" the chips had.
Not recommended.  I'm interested in chips that work correctly, hence the
interest in ECC and AMD.  Looked for ARM boards with ECC but didn't find
any.  Is the Sparc stuff any more reliable than it used to be?  Other
arch choices?

> The MRC setup code is a binary blob for otherwise open source boot
> firmware such as Coreboot.

So the libreboot people are forced to work on reverse engineering
these blobs?  :-(

Don:
> I don't think the current APU parts support ECC.

According to wikipedia, socket FM2+ does not support ECC. :-(
Kabini has support for ECC.  And Berlin, (and I assume Toronto) but
word is that Berlin and Toronto are basically dead. :-(
I think Carrizo and Turion are supposed to support ECC?  There really
ought to be a list of which CPUs/APUs/sockets/boards do or do not
support ECC.

> My experience is that many ASUS motherboard support ECC RAM and
> usually document that fact.  Also many Gigabyte mother boards also
> support ECC RAM, but don't document it.

>From what I've been reading, both Asus and Gigabyte make good boards.
I've seen reviews that complained about Gigabyte's firmware.
http://www.xbitlabs.com/articles/mainboards/display/gigabyte-ga-990fxa-ud5_8.html
I've also seen claims that the firmware bricked boards.
Reviewers like Asus' firmware.  I've seen complaints about Asus's support,
and their website has significant problems.

The firmware on my Tyan board is crap, and they refused to tell me
how much power it needs.  Which means I don't know how much other stuff
I can run from the same P/S.  It should have *way* more power than needed,
but experience says "not enough", so I added a 2nd p/s for the disk farm
and suddenly had fewer problems.  The 2 p/s setup does allow powercycling
the mainboard (because of the crappy firmware) without powercycling the disks.

Given my experience with the Tyan board, and the apparent lack of
FLOSS firmware for recent boards, I'm not real excited about the
Gigabyte boards.  Asus has a couple of AMD3+ boards that I could
probably live with, if their website actually had things like
lists of exactly which CPUs and memory are approved, and firmware
updates, ... But there are also applications could use a lower wattage
solution.

Anyone have opinions on other mainboard companies?  ECS?  Asrock?
MSI?  Zotac?  Others?

Don:
> +MCA: Bank 4, Status 0x944a400096080a13
> +MCA: Global Cap