On Wed, Dec 24, 2014 at 5:16 AM, Holger Hoffstätte <holger.hoffstae...@googlemail.com> wrote: > > There's light and deep scrub; the former does what you described, > while deep does checksumming. In case of mismatch it should create > a quorum. Whether that actually happens and/or works is another > matter. ;) >
If you have 10 copies of a file and 9 are identical and 1 differs, then there is little risk in this approach. The problem is that if you have two copies of the file and they are different, all it can do is pick one, which is what I believe it does. So, not only isn't this as efficient as n+2 raid, or 2*n raid, but you end up needing 3-4*n redundancy. That is a LOT of wasted space simply to avoid having a checksum. > Unfortunately a full point-in-time deep scrub and the resulting creation > of checksums is more or less economically unviable with growing amounts > of data; this really should be incremental. Since checksums aren't stored anywhere, you end up having to scan every node and compare all the checksums across them. Depending on how that works it is likely to be a fairly synchronous operation, which makes it much harder to deal with file access during the operation. If they just sequentially scan each disk, create an index, sort the index, and then pass it on to some central node to do all the comparisions that would be better than doing it completely synchronously. > > I know how btrfs scrub works, but it too (and in fact every storge system) > suffers from the problem of having to decide which copy is "good"; they > all have different points in their timeline where they need to make a > decision at which a checksum is considered valid. When we're talking > about preventing bitrot, just having another copy is usually enough. > > On top of that btrfs will at least tell you which file is suspected, > thanks to its wonderful backreferences. btrfs maintains checksums for every block on the disk apart from those blocks. Sure, if your metadata and data all gets corrupted at once you could have problems, but you'll at least know that you have problems. A btrfs scrub is asynchronous - each disk can be checked independently of the others as there is no need to compare checksums for files across disks, since the checksums are pre-calculated. If a bad extent is found, it is re-copied from one of the good disks (which of course is synchronous). Since the scans are asynchronous it performs a lot better than a RAID scrub, since a read against a mirror can be allowed to disrupt just one of the device scrubs while the other proceeds. Indeed, you could just scrub the devices one at a time and then only writes or parallel reads take a hit (for mirrored mode). Btrfs is of course immature and can't recover errors for raid5/6 modes, and of course those raid modes would not perform as well when being scrubbed since a read requires access to n disks and a write requires access to n+1/2 disks (note though that the use of checksums makes it safe to do a read without reading full parity - I have no idea if the btrfs implementation takes advantage of this). For a single-node system btrfs (and of course zfs) have a much more robust design IMHO. Now, of course the node itself becomes the bottleneck and that is what ceph is intended to handle. The problem is that like pre-zfs RAID it handles total failure well, and data corruption less-well. Indeed, unless it always checks multiple nodes on every read a silent corruption is probably not going to be detected without a scrub (while btrfs and zfs compare checksums on EVERY read, since that is much less expensive than reading multiple devices). I'm sure this could be fixed in ceph, but it doesn't seem like anybody is prioritizing that. -- Rich