On Wed, Dec 24, 2014 at 5:16 AM, Holger Hoffstätte
<holger.hoffstae...@googlemail.com> wrote:
>
> There's light and deep scrub; the former does what you described,
> while deep does checksumming. In case of mismatch it should create
> a quorum. Whether that actually happens and/or works is another
> matter. ;)
>

If you have 10 copies of a file and 9 are identical and 1 differs,
then there is little risk in this approach.  The problem is that if
you have two copies of the file and they are different, all it can do
is pick one, which is what I believe it does.  So, not only isn't this
as efficient as n+2 raid, or 2*n raid, but you end up needing 3-4*n
redundancy.  That is a LOT of wasted space simply to avoid having a
checksum.

> Unfortunately a full point-in-time deep scrub and the resulting creation
> of checksums is more or less economically unviable with growing amounts
> of data; this really should be incremental.

Since checksums aren't stored anywhere, you end up having to scan
every node and compare all the checksums across them.  Depending on
how that works it is likely to be a fairly synchronous operation,
which makes it much harder to deal with file access during the
operation.  If they just sequentially scan each disk, create an index,
sort the index, and then pass it on to some central node to do all the
comparisions that would be better than doing it completely
synchronously.

>
> I know how btrfs scrub works, but it too (and in fact every storge system)
> suffers from the problem of having to decide which copy is "good"; they
> all have different points in their timeline where they need to make a
> decision at which a checksum is considered valid. When we're talking
> about preventing bitrot, just having another copy is usually enough.
>
> On top of that btrfs will at least tell you which file is suspected,
> thanks to its wonderful backreferences.

btrfs maintains checksums for every block on the disk apart from those
blocks.  Sure, if your metadata and data all gets corrupted at once
you could have problems, but you'll at least know that you have
problems.  A btrfs scrub is asynchronous - each disk can be checked
independently of the others as there is no need to compare checksums
for files across disks, since the checksums are pre-calculated.  If a
bad extent is found, it is re-copied from one of the good disks (which
of course is synchronous).

Since the scans are asynchronous it performs a lot better than a RAID
scrub, since a read against a mirror can be allowed to disrupt just
one of the device scrubs while the other proceeds.  Indeed, you could
just scrub the devices one at a time and then only writes or parallel
reads take a hit (for mirrored mode).

Btrfs is of course immature and can't recover errors for raid5/6
modes, and of course those raid modes would not perform as well when
being scrubbed since a read requires access to n disks and a write
requires access to n+1/2 disks (note though that the use of checksums
makes it safe to do a read without reading full parity - I have no
idea if the btrfs implementation takes advantage of this).

For a single-node system btrfs (and of course zfs) have a much more
robust design IMHO.  Now, of course the node itself becomes the
bottleneck and that is what ceph is intended to handle.  The problem
is that like pre-zfs RAID it handles total failure well, and data
corruption less-well.  Indeed, unless it always checks multiple nodes
on every read a silent corruption is probably not going to be detected
without a scrub (while btrfs and zfs compare checksums on EVERY read,
since that is much less expensive than reading multiple devices).

I'm sure this could be fixed in ceph, but it doesn't seem like anybody
is prioritizing that.

--
Rich

Reply via email to