Yes, but not all of those computers "matter" in the sense that if their
data is corrupt, the results may not be tragic.  If your Playstation
crashes, or you lose your save game data, you probably don't care.  I'm
willing to take the risk of my phone's memory being corrupted, as I assume
it isn't ECC protected throughout.

Even my Mac (which I confess I'm not sure if it uses ECC memory, though I
doubt it) has only data I don't care about.  I can reproduce all of it at
need.   (Admittedly, the mac also runs HFS+, which I've good reason not to
trust much further than I could throw a stack of 100 macs, but that's
beside the point.  Or maybe not -- a software bug is far more likely to
cause data corruption than ECC, at least in my own experience.)

Anyway, this is what *BACKUPS* are for.

Memory corruption causing problems is an extremely rare event.
 Sufficiently rare that most people can probably forget about it, MOST OF
THE TIME, and only deal with the failures when they occur by restoring from
a backup.   ECC is about preventing the need to restore most of that time,
and therefore avoiding the downtime -- it is NOT a substitute for data
backups.

Putting your most treasured memories on a single system -- even with ECC --
without any backup is foolhardy.  Doing so on a system without ECC just
greatly increases your exposure, since RAM parts do indeed fail.

Anyway, unless you can cover a substantial portion of the other 90+% of the
ways that a bit error in memory can corrupt your data, there is very little
point in trying to cover this one.  (Compared with actual media error
correction, which is a much more frequent and thus severe problem.)

The idea that someone is blaming ZFS for losing their data in the face of
corrupted memory, that they didn't backup, and where they couldn't be
bothered to provide even ECC memory protection, is nothing short of
ridiculous.  It should certainly not be used as a justification for adding
additional complexity to the code base.


On Mon, Jul 7, 2014 at 4:17 PM, Neal H. Walfield <n...@walfield.org> wrote:

> At Mon, 7 Jul 2014 16:08:15 -0700,
> Richard Elling wrote:
> >
> > On Jul 7, 2014, at 8:49 AM, Neal H. Walfield <n...@walfield.org> wrote:
> >
> > > At Mon, 07 Jul 2014 17:08:49 +0200,
> > > Saso Kiselkov wrote:
> > >>> 1. Correcting bit errors that occur in memory after the data has been
> > >>> hashed but before the data has been sent to disk.  When the data is
> > >>> reread, the checksum won't match and ZFS will choke.
> > >>
> > >> Can you quantify or even just cite confirmed examples of this taking
> > >> place? Blocks spend relatively little time in-flight (where random bit
> > >> flips can happen) and comparatively large amounts of time in storage
> on
> > >> platters or in flash cells (where they are already protected by the
> > >> native ECC of the drive).
> > >
> > > No, I know of no confirmed examples.
> > >
> > > However, this case (or something similar) is often cited on the
> > > zfs-disc...@zfsonlinux.org list as an example of why ECC is essential
> > > and **ZFS shouldn't be used on non-ECC systems**: whereas ext4 will
> > > give you back your corrupted data, ZFS will choke.  Now, there is an
> > > argument to be made for fail fast (i.e., at the ZFS layer, in this
> > > case), but there is another argument to be made for not choking.  And,
> > > given that a bit flip is often not actual bad (e.g., multimedia data
> > > is tolerant to some errors), it's hard to discount this argument.
> >
> > Armchair reliability engineering :-(
> >
> > If you can't trust the memory, then you can't trust the OS running on
> that memory.
> > Period. End of discussion.
>
> There are billions of computers that don't use ECC memory.
> _______________________________________________
> developer mailing list
> developer@open-zfs.org
> http://lists.open-zfs.org/mailman/listinfo/developer
>
_______________________________________________
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to