Am 20.02.2023 um 18:42 schrieb krys...@ibse.cz: > Dear Debian community, > we recently started using AMD Ryzen CPUs, ASRock Rack motherboards and > Kingston unbuffered ECC DIMMs for our small bussiness servers. All the > servers are running on ZFS for which ECC memory is recommended. So I naively > tried to test it actually works. I read EVERY disscussion on EVERY forum I > was able to find (and there is a lot of them, believe me), but I did not find > a satisfying answer. According to the legendary tweet from AMD (for which is > link in every discussion), the Ryzen CPUs should support ECC memory, but it > is not tested feature since they are consumer CPUs. Funny thing is, that > according to their spec sheets even EPYC class CPUs do not support them (only > CPUs with stated ECC support I found are Ryzen Embedded ones - for example > the V1605B in UDOO Bolt). Nevertheless system reports it works - dmidecode, > lshw, kernel loads driver and EDAC MC is present in > /sys/devices/system/edac/mc, even memtest86+ v6.0 and above reports ECC > memory. In forum discussions Intel guys are saying that correctable ECC > errors are relatively common - stated counts vary, but I got the impression > that at least one in a week should appear. And our virtual hypervisor running > over half a year with more than 80% memory utilization has not a single one, > niether in sysfs nor in EUFI event log. I understand that the errror count > rises with height above mean sea level due to solar radiation and we are in > 246m altitude, but at least one error would be nice. > The only thing I had success with was memory overclocking - I lowered timing > as low as possible for system to POST and when Debian was running, it > reported corectable errors from different memory regions (13 during 30 > minutes). Rising memory frequency did not work. But all this was done on Asus > motherboard, with same memory and CPU however. When I change any memory > related setting on ASRock Rack motherboard, it will not POST. > In kernel documentation is described that Intel CPUs have ability to inject > errors for driver testing but I did not find anything like it for AMD. Does > anyone know any way to test that ECC works without breaking the system > before? Thank you for your answers. > > PS: Some commercial memtests should allegedly be able to inject ECC errors > (for example the one from passmark), have anyone tried those? > > Best regards, > Kryštof > just saying:
i am on the same ship ... (ZFS + AMD (2 EPYCs in my case) + ECC + not verified behavior) Previously, i was using Intel, where i got edac to work somehow, and it even caught some correctable errors. But since i learned, that edac went out of business and dmidecode shall be used to get info from hardware interrupt caused by ECC memory, i have never seen one, and as a less than experienced debian user, i got stuck on other problems, thus forgot to pursue this issue somehow. Now, i am very much interested in the hints/replies you may get, in order to finally test/straighten my infrastructure. Did you really read, that epycs cannot support ECC? At least i can say, that my pools did not report any faults (which ofc would be several layers above ecc) either in 3 years, which did help in falling asleep. ;-) Anyone experiencing some wind in his sails while sailing along similar paths? ... Would be welcome ... DdB