On Thu, Apr 11, 2013 at 02:10:37AM +0000, James Harper wrote:
>
> > with disks (and raid arrays) of that size, you also have to be concerned
> > about data errors as well as disk failures - you're pretty much
> > guaranteed to get some, either unrecoverable errors or, worse, silent
> > corruption of the data.
> 
> Guaranteed over what time period? 

any time period.  it's a function of the quantity of data, not of time.

> It's easy to fault your logic as I just did a full scan of my array
> and it came up clean.

no, it's not.  your array scan checks for DISK errors.  It does not check for
data corruption - THAT is the huge advantage of filesystems like ZFS and
btrfs, they can detect and correct data errors

> If you say you are "guaranteed to get some" over, say, a 10 year
> period, then I guess that's fair enough. But as you don't specify a
> timeframe I can't really contest the point.

you seem to be confusing data corruption with MTBF or similar, it's
not like that at all. it's not about disk hardware faults, it's about
the sheer size of storage arrays these days making it a mathematical
certainty that some corruption will occur - write errors due to, e.g.,
random bit-flaps, controller brain-farts, firmware bugs, cosmic rays,
and so on.

e.g. a typical quoted rating of 1 error per 10^14 bits is one error per
12 terabytes - i.e. your four x 3TB array is guaranteed to have at least
one error in the data.

one error in 10^14 bits is nothing to worry about with 500GB drives.
it's starting to get worrisome with 1 and 2TB drives.  It's a guaranteed
error with 10+TB arrays....and even a single 3 or 4TB drive has roughly
a 30-50% chance of having at least one data error.


> I can say though that I do monitor the SMART values which do track
> corrected and uncorrected error rates, and by extrapolating those
> figures I can say with confidence that there is not a guarantee of
> unrecoverable errors.

smart values really only tell you about detected errors in the drive
itself. they don't tell you *anything* about data corruption problems -
for that, you actually need to check the data...and to check the data
you need a redundant copy or copies AND a hash of what it's supposed to
be.

with mdadm, such errors can only be corrected if the data can be
rewritten to the same sector or if the drive can remap a spare sector to
that spot. with zfs, because it's a COW filesystem all that needs to be
done is to rewrite the data.

> > http://en.wikipedia.org/wiki/ZFS#Error_rates_in_harddisks
> 
> The part that says "not visible to the host software" kind of bothers me. 

yes, that's why it's a problem, and that's why a filesystem that keeps
both multiple copies (mirroring or raid5/6-like) AND a hash of each
block is essential for detecting and correcting errors in the data.


> AFAICS these are reported via SMART and are entirely visible, with
> some exceptions of poor SMART implementations.

no.  SMART detects disk faults, not data corruption.


> > personally, i wouldn't use raid-5 (or raid-6) any more.  I'd use ZFS
> > RAID-Z (raid5 equiv) or RAID-Z2 (raid6 equiv. with 2 parity disks)
> > instead.
>
> Putting the error correction/detection in the filesystem bothers
> me.  Putting it at the block device level would benefit a lot more
> infrastructure - LVM volumes for VM's, swap partitions, etc.

having used ZFS for quite some time now, it makes perfect sense to me
for it to be in the filesystem layer rather in the block level - it's
the file system that knows about the data, what/where it is, and whether
it's in use or not (so, faster scrubs - only need to check blocks in use
rather than all blocks).

but that's partly because ZFS blends the block layer and the fs layer in
a way that seems unusual if you're used to ext4 or xfs or pretty much
anything else except btrfs.  see below for more on this topic.

> I understand you can run those things on top of a filesystem also, but
> if you are doing this just to get the benefit of error correction then
> I think you might be doing it wrong.

Error correction is a big benefit, but it's not the only one. the
2nd major benefit is snapshots (fast and lightweight because ZFS is
a copy-on-write or COW fs, so a snapshot is little more than just
keeping a copy of the block list at the time of the snapshot, and not
deleting/re-using those blocks while any snapshot references them).


> Actually when I was checking over this email before hitting send
> it occurred to me that maybe I'm wrong about this, knowing next to
> nothing about ZFS as I do. Is a zpool virtual device like an LVM lv,
> and I can use it for things other than running ZFS filesystems on?

yes, ZFS is like a combination of mdadm, lvm, and a filesystem.

a zpool is like an lvm volume group. e.g. you might allocate 4
drives to a raid-z array and call that pool "export". unlike LVM you
don't have to pre-allocate fixed chunks of the volume to particular uses
(e.g. filesystems or logical drives/partitions), you can dynamically
change the "allocation" as needed.

it's also like a filesystem in that you can mount that pool directly
as, say, /export (or wherever you want) and read and write data to it.
you can also create subvolumes (e.g. export/www) and mount them too.

each subvolume inherits attributes (quota, compression, de-duping,
and lots more) from the parent or can have individual attributes
different from the parent. each subvolume can also have subvolumes (e.g.
export/www/site1, export/www/site2).

each of these subvolumes is like a separate filesystem that shares in
the total pool size, and each can be snapshotted individually.

you can create new subvolumes aka filesystems on the fly as needed, or
change them (e.g. change the quota from 10G to 20G or enable compression
etc) or delete them.

you can also create a ZVOL, which is just like a zfs subvolume except
that it appears to the system as a virtual disk - i.e. with a device
node under /dev. typical use is for xen or kvm VM images. or even swap
devices. as with subvolumes, they can have individual attributes like
compression or de-duping, and they can also be resized if needed (resize
the zvol on the zfs host, and then inside the VM you need to run xfs
resize or ext4 resize so that it recognises the extra capacity).  ZVOLs
can also be snapshotted just like subvolumes.

they can also be exported as iscsi targets, so you can, e.g., easily
serve disk images to your VM compute nodes.


in short: a subvolume is like a subdirectory or mount-point, while a
ZVOL is like a disk image or partition (incl. an LVM partition)


BTW, some of what i've written above isn't strictly accurate....i've
tried to translate ZFS concepts into terms that should be familiar to
someone who has worked with mdadm and LVM. as an analogy, i've done
reasonably well i think. a technological pedant would probably find
much to complain about. i'm more interested in having what i write be
understood than in having it perfectly correct.


> Despite my reservations mentioned above, ZFS is still on my (long)
> list of things to look into and learn about, more so given that you
> say it is now considered stable :)

it's definitely worth experimenting with on some spare hardware - but be
warned, you will almost certainly want to convert appropriate production
systems from mdadm+lvm to ZFS asap once you start playing with it.

i got hooked on the idea of what ZFS is doing by experimenting with
btrfs. btrfs has a lot of similar ideas, but the implementation (aside
from having different goals) is many years behind ZFS. I persevered
with btrfs for a while because it was in the mainline kernel and didn't
require any stuffing around installing third-party code (zfs) that would
never get in the mainline kernel.

i lost my btrfs array (fortunately only a /backup mount, so not
irreplacable) one too many times and switcbed to ZFS. it is everything i
ever wanted in a filesystem and volume management - it replaces mdadm,
lvm2, and the XFS and/or ext4 i was previously using.

With the dkms module packages, it isn't even hard to install or use
these days (add the debian wheezy zfs repo and apt-get install it)


craig

-- 
craig sanders <[email protected]>
_______________________________________________
luv-main mailing list
[email protected]
http://lists.luv.asn.au/listinfo/luv-main

Reply via email to