On 2015-11-04 23:06, Duncan wrote:
(Tho I should mention, while not on zfs, I've actually had my own
problems with ECC RAM too.  In my case, the RAM was certified to run at
speeds faster than it was actually reliable at, such that actually stored
data, what the ECC protects, was fine, the data was actually getting
damaged in transit to/from the RAM.  On a lightly loaded system, such as
one running many memory tests or under normal desktop usage conditions,
the RAM was generally fine, no problems.  But on a heavily loaded system,
such as when doing parallel builds (I run gentoo, which builds from
sources in ordered to get the higher level of option flexibility that
comes only when you can toggle build-time options), I'd often have memory
faults and my builds would fail.

The most common failure, BTW, was on tarball decompression, bunzip2 or
the like, since the tarballs contained checksums that were verified on
data decompression, and often they'd fail to verify.

Once I updated the BIOS to one that would let me set the memory speed
instead of using the speed the modules themselves reported, and I
declocked the memory just one notch (this was DDR1, IIRC I declocked from
the PC3200 it was rated, to PC3000 speeds), not only was the memory then
100% reliable, but I could and did actually reduce the number of wait-
states for various operations, and it was STILL 100% reliable.  It simply
couldn't handle the raw speeds it was certified to run, is all, tho it
did handle it well enough, enough of the time, to make the problem far
more difficult to diagnose and confirm than it would have been had the
problem appeared at low load as well.

As it happens, I was running reiserfs at the time, and it handled both
that hardware issue, and a number of others I've had, far better than I'd
have expected of /any/ filesystem, when the memory feeding it is simply
not reliable.  Reiserfs metadata, in particular, seems incredibly
resilient in the face of hardware issues, and I lost far less data than I
might have expected, tho without checksums and with bad memory, I imagine
I had occasional undetected bitflip corruption in files here or there,
but generally nothing I detected.  I still use reiserfs on my spinning
rust today, but it's not well suited to SSD, which is where I run btrfs.

But the point for this discussion is that just because it's ECC RAM
doesn't mean you can't have memory related errors, just that if you do,
they're likely to be different errors, "transit errors", that will tend
to be undetected by many memory checkers, at least the ones that don't
tend to run full out memory bandwidth if they're simply checking that
what was stored in a cell can be read back, unchanged.)
I've actually seen similar issues with both ECC and non-ECC memory myself. Any time I'm getting RAM for a system that I can afford to over-spec, I get the next higher speed and under-clock it (which in turn means I can lower the timing parameters and usually get a faster system than if I was running it at the rated speed). FWIW, I also make a point of doing multiple memtest86+ runs (at a minimum, one running single core, and one with forced SMP) when I get new RAM, and even have a run-level configured on my Gentoo based home server system where it boots Xen and fires up twice as many VM's running memtest86+ as I have CPU cores, which is usually enough to fully saturate memory bandwidth and check for the type of issues you mentioned having above (although the BOINC client I run usually does a good job of triggering those kind of issues fast, distributed computing apps tend to be memory bound and use a lot of memory bandwidth).

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to