Avi Kivity wrote:
Stephan von Krawczynski wrote:
- filesystem autodetects, isolates, and (possibly) repairs errors
- online "scan, check, repair filesystem" tool initiated by admin
- Reliability so high that they never run that check-and-fix tool
That is _wrong_ (to a certain extent). You _want to run_ diagnostic
tools to
make sure that there is no problem. And you don't want some software
(not even
HAL) to repair errors without prior admin knowledge/permission
I think there's a place for a scrubber to continuously verify
filesystem data and metadata, at low io priority, and correct any
correctable errors. The admin can read the error correction report at
their leisure, and then take any action that's outside the
filesystem's capabilities (like ordering and installing new disks).
Scrubbing is key for many scenarios since errors can "grow" even in
places where previous IO has been completed without flagging an error.
Some neat tricks are:
(1) use block level scrubbing to detect any media errors. If you can
map that sector level error into a file system object (meta data, file
data or unallocated space), tools can recover (fsck, get another copy of
the file or just ignore it!). There is a special command called
"READ_VERIFY" that can be used to validate the sectors without actually
moving data from the target to the host, so you can scrub without
consuming page cache, etc.
(2) sign and validate the object at the file level, say by
validating a digital signature. This can catch high level errors (say
the app messed up).
Note that this scrubbing needs to be carefully tuned to not interfere
with the foreground workload, using something like IO nice or the other
IO controllers being kicked about might help :-)
ric
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html