Am 20.02.2023 um 18:42 schrieb krys...@ibse.cz:
> Dear Debian community,
> we recently started using AMD Ryzen CPUs, ASRock Rack motherboards and 
> Kingston unbuffered ECC DIMMs for our small bussiness servers. All the 
> servers are running on ZFS for which ECC memory is recommended. So I naively 
> tried to test it actually works. I read EVERY disscussion on EVERY forum I 
> was able to find (and there is a lot of them, believe me), but I did not find 
> a satisfying answer. According to the legendary tweet from AMD (for which is 
> link in every discussion), the Ryzen CPUs should support ECC memory, but it 
> is not tested feature since they are consumer CPUs. Funny thing is, that 
> according to their spec sheets even EPYC class CPUs do not support them (only 
> CPUs with stated ECC support I found are Ryzen Embedded ones - for example 
> the V1605B in UDOO Bolt). Nevertheless system reports it works -  dmidecode, 
> lshw, kernel loads driver and EDAC MC is present in 
> /sys/devices/system/edac/mc, even memtest86+ v6.0 and above reports ECC 
> memory. In forum discussions Intel guys are saying that correctable ECC 
> errors are relatively common - stated counts vary, but I got the impression 
> that at least one in a week should appear. And our virtual hypervisor running 
> over half a year with more than 80% memory utilization has not a single one, 
> niether in sysfs nor in EUFI event log. I understand that the errror count 
> rises with height above mean sea level due to solar radiation and we are in 
> 246m altitude, but at least one error would be nice.
> The only thing I had success with was memory overclocking - I lowered timing 
> as low as possible for system to POST and when Debian was running, it 
> reported corectable errors from different memory regions (13 during 30 
> minutes). Rising memory frequency did not work. But all this was done on Asus 
> motherboard, with same memory and CPU however. When I change any memory 
> related setting on ASRock Rack motherboard, it will not POST.
> In kernel documentation is described that Intel CPUs have ability to inject 
> errors for driver testing but I did not find anything like it for AMD. Does 
> anyone know any way to test that ECC works without breaking the system 
> before? Thank you for your answers.
> 
> PS: Some commercial memtests should allegedly be able to inject ECC errors 
> (for example the one from passmark), have anyone tried those?
> 
> Best regards,
> Kryštof
> 
just saying:

i am on the same ship ... (ZFS + AMD (2 EPYCs in my case) + ECC + not
verified behavior)
Previously, i was using Intel, where i got edac to work somehow, and it
even caught some correctable errors. But since i learned, that edac went
out of business and dmidecode shall be used to get info from hardware
interrupt caused by ECC memory, i have never seen one, and as a less
than experienced debian user, i got stuck on other problems, thus forgot
to pursue this issue somehow. Now, i am very much interested in the
hints/replies you may get, in order to finally test/straighten my
infrastructure.

Did you really read, that epycs cannot support ECC?
At least i can say, that my pools did not report any faults (which ofc
would be several layers above ecc) either in 3 years, which did help in
falling asleep. ;-)

Anyone experiencing some wind in his sails while sailing along similar
paths?

... Would be welcome ...
DdB

Reply via email to