Re: ECC support
> On 16 Sep 2015, at 12:52, Igor Mozolevskywrote: > > On 16 September 2015 at 12:34, Bob Bishop wrote: > > > > >> "The best we can conclude therefore is that any chip size effect is >> unlikely to dominate error rates given that the trends are not consistent >> across various other confounders such as age and manufacturer.” >> >> I’ll admit to talking that point up a bit but it is counterintuitive. >> Memory designers have always been scared of cosmic rays etc but the >> suspected effects simply have not been noticeable. Most likely as they >> shrink features ever smaller, other factors like material purity dominate. >> > > I saw that after I posted, and had a long ponder as to why it would be so. > The only thing I could think of is that the fab process was(/is?) large > enough to not worry about "nonsense" like cosmic rays (but then I've not > had much exposure to semi-conductor electronics theory since late 90s). > Perhaps we're at a point where the fab process can't really shrink much > more with DRAM due to the underlying tech (effectively many tiny RC > circuits), which is the reason the manufacturers just stack ranks to get > more capacity per DIMM instead of packing more in a single chip?.. Dunno. I’ll ask my tame semiconductor expert when I see him tomorrow... > -- > Igor M. > ___ > freebsd-hardware@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-hardware > To unsubscribe, send any mail to "freebsd-hardware-unsubscr...@freebsd.org" -- Bob Bishop r...@gid.co.uk ___ freebsd-hardware@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-hardware To unsubscribe, send any mail to "freebsd-hardware-unsubscr...@freebsd.org"
Re: ECC support
On 16 September 2015 at 08:51, Bob Bishopwrote: > - You might think that as memory density increases (ie bit cell size shrinks), error rates would increase. Apparently this wasn’t so up to 2009 at least, see: > > http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf subsection 5.1: "… Figure 6 indicates a trend towards worse error behavior for increased capacities, although this trend is not consis- tent. While in some cases the doubling of capacity has a clear negative effect (factors larger than 1 in the graph), in others it has hardly any effect (factor close to 1 in the graph). For example, for Platform A -Mfg1 and Platform F - Mfg1 doubling the capacity increases uncorrectable errors, but not correctable errors. Conversely, for Platform D - Mfg6 doubling the capacity affects correctable errors, but not uncorrectable error." There are also other environmental factors which would be more apparent in "lone-server" configuration vs well maintained and insulated data centres with very good power conditioning ;-) -- Igor M. ___ freebsd-hardware@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-hardware To unsubscribe, send any mail to "freebsd-hardware-unsubscr...@freebsd.org"
Re: ECC support
On 16 September 2015 at 12:34, Bob Bishopwrote: > "The best we can conclude therefore is that any chip size effect is > unlikely to dominate error rates given that the trends are not consistent > across various other confounders such as age and manufacturer.” > > I’ll admit to talking that point up a bit but it is counterintuitive. > Memory designers have always been scared of cosmic rays etc but the > suspected effects simply have not been noticeable. Most likely as they > shrink features ever smaller, other factors like material purity dominate. > I saw that after I posted, and had a long ponder as to why it would be so. The only thing I could think of is that the fab process was(/is?) large enough to not worry about "nonsense" like cosmic rays (but then I've not had much exposure to semi-conductor electronics theory since late 90s). Perhaps we're at a point where the fab process can't really shrink much more with DRAM due to the underlying tech (effectively many tiny RC circuits), which is the reason the manufacturers just stack ranks to get more capacity per DIMM instead of packing more in a single chip?.. -- Igor M. ___ freebsd-hardware@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-hardware To unsubscribe, send any mail to "freebsd-hardware-unsubscr...@freebsd.org"
Re: ECC support
Hi, > On 16 Sep 2015, at 11:48, Igor Mozolevskywrote: > > On 16 September 2015 at 08:51, Bob Bishop wrote: > > > >> - You might think that as memory density increases (ie bit cell size > shrinks), error rates would increase. Apparently this wasn’t so up to 2009 > at least, see: >> >> http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf > > subsection 5.1: > > "… Figure 6 indicates a trend towards worse error behavior > for increased capacities, although this trend is not consis- > tent. [etc] That’s talking about DIMM capacity, not the capacity (density) of individual chips on which they say (at the end of the same subsection): "The best we can conclude therefore is that any chip size effect is unlikely to dominate error rates given that the trends are not consistent across various other confounders such as age and manufacturer.” I’ll admit to talking that point up a bit but it is counterintuitive. Memory designers have always been scared of cosmic rays etc but the suspected effects simply have not been noticeable. Most likely as they shrink features ever smaller, other factors like material purity dominate. > There are also other environmental factors which would be more apparent in > "lone-server" configuration vs well maintained and insulated data centres > with very good power conditioning ;-) Indeed, and that’s a whole other PITA. We went to colo and never looked back, but low-power options for small servers are getting better. > -- > Igor M. > ___ > freebsd-hardware@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-hardware > To unsubscribe, send any mail to "freebsd-hardware-unsubscr...@freebsd.org" -- Bob Bishop r...@gid.co.uk ___ freebsd-hardware@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-hardware To unsubscribe, send any mail to "freebsd-hardware-unsubscr...@freebsd.org"
Re: ECC support
On 16 Sep, Dieter BSD wrote: > Andriy: >>> Assuming that a board does have the necessary connections but >>> the firmware does not have ECC support, is there some reason that >>> ECC support could not be added to the OS instead of the firmware? >> >> Yes, there is. The memory controller is programmed by the code that >> runs from ROM and uses no RAM (or the CPU cache is used as the RAM). >> Once the real RAM gets used it's too late to reprogram the DRAM controller. > > Perhaps one of the several bootloader stages could get itelf into > CPU cache, program the memory controller, then load and execute the > next stage or the OS? > > Jim: >> Replacing the data in memory would require processing overhead >> that could accumulate and significantly diminish system performance. > > If it only replaces data when there is a correctable error, > and the errors are occasional soft errors, the effect on > performance should be minimal. If there is a hard error, > you would want to replace the defective memory before you get > an additional error and it becomes uncorrectable. > >> If the error occurred because of random events and isn't a defect in >> the memory, the memory address will be cleaned of the error when the >> data is overwritten with other data. > > If and when new data gets written to that location. If that location > contains info that never changes, such as kernel text, the bad bit will > never get fixed. > >> memory, without the extra complexity of the controller, is 12.5% more >> expensive. This <80><99>t a huge impact at 8GB, (<80><99>ll need >> another 1GB of RAM), but at 1024GB <80><99>ll need another 128GB, >> and that much ram still costs enough that your wallet <80><99>t be happy. > > It is 12.5% in both cases. How much does it cost to have undetected > errors in your data? How much does it cost when an Interstate > bridge collapses? How much does it cost when one of NASA's missions > fails? How much does it cost when your pharmacy receives a > prescription with an error in the dose? > >> the MRC setup on Intel and AMD is both complex and proprietary > > One wonders why the secrecy. AMD has been much more open than many > (most?) chipmakers. They even forced the ATI people to document > how to program their chips. I don't see a lot of companies popping up > making competing chips. #include standard joke: "How do you make a small > fortune in chipmaking? Start with a very large fortune." I can't > see what secret would be revealed by saying "set bit 7 of register 4 > to 1 to enable ECC". AMD documents a lot of this stuff in the BIOS and Kernel Developer's Guide (BKDG) for each CPU family. >> Intel Red Book > > So the secret books are red this week, yawn. I remember the nightmare > of the merced orange books and the brain damaged "features" the chips had. > Not recommended. I'm interested in chips that work correctly, hence the > interest in ECC and AMD. Looked for ARM boards with ECC but didn't find > any. Is the Sparc stuff any more reliable than it used to be? Other > arch choices? Supermicro has some Atom motherboards with ECC support. >> The MRC setup code is a binary blob for otherwise open source boot >> firmware such as Coreboot. > > So the libreboot people are forced to work on reverse engineering > these blobs? :-( > > Don: >> I don't think the current APU parts support ECC. > > According to wikipedia, socket FM2+ does not support ECC. :-( > Kabini has support for ECC. And Berlin, (and I assume Toronto) but > word is that Berlin and Toronto are basically dead. :-( > I think Carrizo and Turion are supposed to support ECC? There really > ought to be a list of which CPUs/APUs/sockets/boards do or do not > support ECC. Socket AM1 (Kabini) is supposed to support ECC, but motherboards with this socket that support ECC is another story. >> My experience is that many ASUS motherboard support ECC RAM and >> usually document that fact. Also many Gigabyte mother boards also >> support ECC RAM, but don't document it. > > From what I've been reading, both Asus and Gigabyte make good boards. > I've seen reviews that complained about Gigabyte's firmware. > http://www.xbitlabs.com/articles/mainboards/display/gigabyte-ga-990fxa-ud5_8.html > I've also seen claims that the firmware bricked boards. > Reviewers like Asus' firmware. I've seen complaints about Asus's support, > and their website has significant problems. I've got one of the Gigabyte GA_990FXA-UD5 boards. I actually like the BIOS. I'm not trying to overclock, but it does have lots of ECC-related knobs. I think you can even tell it to gang the two memory controller channels so that you can enable Chipkill. The latter isn't as good as it sounds because it really only works properly with DIMMs that us x4 DRAM chips, and there don't seem to be any unbuffered versions of those. The only unbuffered DDR3 DIMMS I've found use x8 DRAM chips. In that case if a multiple bits coming out of the chip are
Re: ECC support
Andriy: >> Assuming that a board does have the necessary connections but >> the firmware does not have ECC support, is there some reason that >> ECC support could not be added to the OS instead of the firmware? > > Yes, there is. The memory controller is programmed by the code that > runs from ROM and uses no RAM (or the CPU cache is used as the RAM). > Once the real RAM gets used it's too late to reprogram the DRAM controller. Perhaps one of the several bootloader stages could get itelf into CPU cache, program the memory controller, then load and execute the next stage or the OS? Jim: > Replacing the data in memory would require processing overhead > that could accumulate and significantly diminish system performance. If it only replaces data when there is a correctable error, and the errors are occasional soft errors, the effect on performance should be minimal. If there is a hard error, you would want to replace the defective memory before you get an additional error and it becomes uncorrectable. > If the error occurred because of random events and isn't a defect in > the memory, the memory address will be cleaned of the error when the > data is overwritten with other data. If and when new data gets written to that location. If that location contains info that never changes, such as kernel text, the bad bit will never get fixed. > memory, without the extra complexity of the controller, is 12.5% more > expensive. This <80><99>t a huge impact at 8GB, (<80><99>ll need > another 1GB of RAM), but at 1024GB <80><99>ll need another 128GB, > and that much ram still costs enough that your wallet <80><99>t be happy. It is 12.5% in both cases. How much does it cost to have undetected errors in your data? How much does it cost when an Interstate bridge collapses? How much does it cost when one of NASA's missions fails? How much does it cost when your pharmacy receives a prescription with an error in the dose? > the MRC setup on Intel and AMD is both complex and proprietary One wonders why the secrecy. AMD has been much more open than many (most?) chipmakers. They even forced the ATI people to document how to program their chips. I don't see a lot of companies popping up making competing chips. #include standard joke: "How do you make a small fortune in chipmaking? Start with a very large fortune." I can't see what secret would be revealed by saying "set bit 7 of register 4 to 1 to enable ECC". > Intel Red Book So the secret books are red this week, yawn. I remember the nightmare of the merced orange books and the brain damaged "features" the chips had. Not recommended. I'm interested in chips that work correctly, hence the interest in ECC and AMD. Looked for ARM boards with ECC but didn't find any. Is the Sparc stuff any more reliable than it used to be? Other arch choices? > The MRC setup code is a binary blob for otherwise open source boot > firmware such as Coreboot. So the libreboot people are forced to work on reverse engineering these blobs? :-( Don: > I don't think the current APU parts support ECC. According to wikipedia, socket FM2+ does not support ECC. :-( Kabini has support for ECC. And Berlin, (and I assume Toronto) but word is that Berlin and Toronto are basically dead. :-( I think Carrizo and Turion are supposed to support ECC? There really ought to be a list of which CPUs/APUs/sockets/boards do or do not support ECC. > My experience is that many ASUS motherboard support ECC RAM and > usually document that fact. Also many Gigabyte mother boards also > support ECC RAM, but don't document it. >From what I've been reading, both Asus and Gigabyte make good boards. I've seen reviews that complained about Gigabyte's firmware. http://www.xbitlabs.com/articles/mainboards/display/gigabyte-ga-990fxa-ud5_8.html I've also seen claims that the firmware bricked boards. Reviewers like Asus' firmware. I've seen complaints about Asus's support, and their website has significant problems. The firmware on my Tyan board is crap, and they refused to tell me how much power it needs. Which means I don't know how much other stuff I can run from the same P/S. It should have *way* more power than needed, but experience says "not enough", so I added a 2nd p/s for the disk farm and suddenly had fewer problems. The 2 p/s setup does allow powercycling the mainboard (because of the crappy firmware) without powercycling the disks. Given my experience with the Tyan board, and the apparent lack of FLOSS firmware for recent boards, I'm not real excited about the Gigabyte boards. Asus has a couple of AMD3+ boards that I could probably live with, if their website actually had things like lists of exactly which CPUs and memory are approved, and firmware updates, ... But there are also applications could use a lower wattage solution. Anyone have opinions on other mainboard companies? ECS? Asrock? MSI? Zotac? Others? Don: > +MCA: Bank 4, Status 0x944a400096080a13 > +MCA: Global Cap