krys...@ibse.cz wrote: > Dear Debian community, > we recently started using AMD Ryzen CPUs, ASRock Rack motherboards and > Kingston unbuffered ECC DIMMs for our small bussiness servers. All the > servers are running on ZFS for which ECC memory is recommended. So I naively > tried to test it actually works. I read EVERY disscussion on EVERY forum I > was able to find (and there is a lot of them, believe me), but I did not find > a satisfying answer. According to the legendary tweet from AMD (for which is > link in every discussion), the Ryzen CPUs should support ECC memory, but it > is not tested feature since they are consumer CPUs. Funny thing is, that > according to their spec sheets even EPYC class CPUs do not support them (only > CPUs with stated ECC support I found are Ryzen Embedded ones - for example > the V1605B in UDOO Bolt). Nevertheless system reports it works - dmidecode, > lshw, kernel loads driver and EDAC MC is present in > /sys/devices/system/edac/mc, even memtest86+ v6.0 and above reports ECC > memory. In forum discussions Intel guys are saying that correctable ECC > errors are relatively common - stated counts vary, but I got the impression > that at least one in a week should appear. And our virtual hypervisor running > over half a year with more than 80% memory utilization has not a single one, > niether in sysfs nor in EUFI event log. I understand that the errror count > rises with height above mean sea level due to solar radiation and we are in > 246m altitude, but at least one error would be nice. > The only thing I had success with was memory overclocking - I lowered timing > as low as possible for system to POST and when Debian was running, it > reported corectable errors from different memory regions (13 during 30 > minutes). Rising memory frequency did not work. But all this was done on Asus > motherboard, with same memory and CPU however. When I change any memory > related setting on ASRock Rack motherboard, it will not POST. > In kernel documentation is described that Intel CPUs have ability to inject > errors for driver testing but I did not find anything like it for AMD. Does > anyone know any way to test that ECC works without breaking the system > before? Thank you for your answers. > > PS: Some commercial memtests should allegedly be able to inject ECC errors > (for example the one from passmark), have anyone tried those?
We see ECC errors irregularly and infrequently on both Intel and AMD CPUs. One a week would be very concerning if we're talking about one system, but not too concerning if we are discussing a thousand systems. -dsr-