On 2015-11-04 23:06, Duncan wrote:
I've actually seen similar issues with both ECC and non-ECC memory myself. Any time I'm getting RAM for a system that I can afford to over-spec, I get the next higher speed and under-clock it (which in turn means I can lower the timing parameters and usually get a faster system than if I was running it at the rated speed). FWIW, I also make a point of doing multiple memtest86+ runs (at a minimum, one running single core, and one with forced SMP) when I get new RAM, and even have a run-level configured on my Gentoo based home server system where it boots Xen and fires up twice as many VM's running memtest86+ as I have CPU cores, which is usually enough to fully saturate memory bandwidth and check for the type of issues you mentioned having above (although the BOINC client I run usually does a good job of triggering those kind of issues fast, distributed computing apps tend to be memory bound and use a lot of memory bandwidth).(Tho I should mention, while not on zfs, I've actually had my own problems with ECC RAM too. In my case, the RAM was certified to run at speeds faster than it was actually reliable at, such that actually stored data, what the ECC protects, was fine, the data was actually getting damaged in transit to/from the RAM. On a lightly loaded system, such as one running many memory tests or under normal desktop usage conditions, the RAM was generally fine, no problems. But on a heavily loaded system, such as when doing parallel builds (I run gentoo, which builds from sources in ordered to get the higher level of option flexibility that comes only when you can toggle build-time options), I'd often have memory faults and my builds would fail.The most common failure, BTW, was on tarball decompression, bunzip2 or the like, since the tarballs contained checksums that were verified on data decompression, and often they'd fail to verify. Once I updated the BIOS to one that would let me set the memory speed instead of using the speed the modules themselves reported, and I declocked the memory just one notch (this was DDR1, IIRC I declocked from the PC3200 it was rated, to PC3000 speeds), not only was the memory then 100% reliable, but I could and did actually reduce the number of wait- states for various operations, and it was STILL 100% reliable. It simply couldn't handle the raw speeds it was certified to run, is all, tho it did handle it well enough, enough of the time, to make the problem far more difficult to diagnose and confirm than it would have been had the problem appeared at low load as well. As it happens, I was running reiserfs at the time, and it handled both that hardware issue, and a number of others I've had, far better than I'd have expected of /any/ filesystem, when the memory feeding it is simply not reliable. Reiserfs metadata, in particular, seems incredibly resilient in the face of hardware issues, and I lost far less data than I might have expected, tho without checksums and with bad memory, I imagine I had occasional undetected bitflip corruption in files here or there, but generally nothing I detected. I still use reiserfs on my spinning rust today, but it's not well suited to SSD, which is where I run btrfs. But the point for this discussion is that just because it's ECC RAM doesn't mean you can't have memory related errors, just that if you do, they're likely to be different errors, "transit errors", that will tend to be undetected by many memory checkers, at least the ones that don't tend to run full out memory bandwidth if they're simply checking that what was stored in a cell can be read back, unchanged.)
smime.p7s
Description: S/MIME Cryptographic Signature