> On 9/15/06, can you guess? <[EMAIL PROTECTED]>
> wrote:

...

  file-level, however, is really pushing
> it.  You might end
> up with an administrative nightmare deciphering which
> files have how
> many copies.\

I'm not sure what you mean:  the level of redundancy would be a per-file 
attribute that could be examined, and would be normally just be defaulted to a 
common value.

...

> > It would be interesting to know whether that would
> still be your experience in environments that
> regularly scrub active data as ZFS does (assuming
> that said experience was accumulated in environments
> that don't).  The theory behind scrubbing is that all
> data areas will be hit often enough that they won't
> have time to deteriorate (gradually) to the point
> where they can't be read at all, and early
> deterioration encountered during the scrub pass (or
> other access) in which they have only begun to become
> difficult to read will result in immediate
> revectoring (by the disk or, if not, by the file
> system) to healthier locations.
> 
> Scrubbing exercises the disk area to prevent bit-rot.
>  I do not think
> FS's scrubbing changes the failure mode of the raw
> devices.

It doesn't change the failure rate (if anything, it might accelerate it 
marginally due to the extra disk activity), but it *does* change, potentially 
radically, the frequency with which sectors containing user data become 
unreadable - because it allows them to be detected *before* that happens such 
that the data can be moved to a good sector (often by the disk itself, else by 
higher-level software) and the failing sector marked bad.

  OTOH, I
> really have no such experience to speak of *fingers
> crossed*.  I
> failed to locate the code where the relocation of
> files happens but
> assume that copies would make this process more
> reliable.

Sort of:  while they don't make any difference when you catch a failing sector 
while it's still readable, they certainly help if you only catch it after it's 
become unreadable (or has been 'silently' corrupted).

> 
> > Since ZFS-style scrubbing detects even
> otherwise-indetectible 'silent corruption' missed by
> the disk's own ECC mechanisms, that lower-probability
> event is also covered (though my impression is that
> the probability of even a single such sector may be
> significantly lower than that of whole-disk failure,
> especially in laptop environments).
> 
> I do not any data to support nor dismiss that.

Quite a few years ago Seagate still published such data, but of course I didn't 
copy it down (because it was 'always available' when I wanted it - as I said, 
it was quite a while ago and I was not nearly as well-acquainted with the 
volatility of Internet data as I would subsequently become).  But to the best 
of my recollection their enterprise disks at that time were specced to have no 
worse than 1 uncorrectable error for every petabit read and no worse than 1 
undetected error for every exabit read.

A fairly recent paper by people who still have access to such data suggests 
that the frequency of uncorrectable errors in enterprise drives is still about 
the same, but that the frequency of undetected errors may have increased 
markedly (to perhaps once in every 10 petabits read) - possibly a result of 
ever-increasing on-disk bit densities and the more aggressive error correction 
required to handle them (perhaps this is part of the reason they don't make 
error rates public any more...).  They claim that SATA drives have error rates 
around 10x that of enterprise drives (or an undetected error rate of around 
once per petabit).

Figure out a laptop drive's average data rate and that gives you a mean time to 
encountering undetected corruption.  Compare that to the drive's in-use MTBF 
rating and there you go!  If I haven't dropped a decimal place or three doing 
this in my head, then even if laptop drives have nominal MTBFs equal to desktop 
SATA drives it looks as if it would take an average data rate of 60 - 70 KB/sec 
(24/7, year-in, year-out) for the likelihood of an undetected error to be 
comparable in likelihood to a whole-disk failure:  that's certainly nothing 
much for a fairly well-loaded server in constant (or even just 40 hour/week) 
use, but for a laptop?.

- bill
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to