On 2/19/2014 4:52 PM, Andrew Hume wrote:
an almost counter-intuitive finding.
http://cloud.media.seagate.com/2014/02/18/when-is-my-data-too-big-for-a-raid-storage-solution/?utm_source=linkedin&utm_medium=social&utm_content=Oktopost-LinkedIn-Group&utm_campaign=%28Oktopost%29Feb+2014
This article ignores (purposefully? unintentionally?) some current
trends that are highly relevant.
1) the trend towards decoupled RAID. There is nothing that says that you
have to RAID across a full drive. Recent trends (Isilon, GPFS GNR)
divide the disks into chunks and 'protect' (Reed solomon, XOR, erasure,
or other encoding) across them. When a disk fails, those chunks get
rebuilt in parallel across <n> other disks. Thus, single disk failures
are not proportional to the size of the disk, and the size of the disk
is irrelevant. 5-10 minutes to rebuild all of the parallel chunks that
happen to have been on that disk is typical. There's no gain in overhead
because the overall amount of data protection is the same.
2) the trend towards data verification that doesn't wait until a drive
fails to determine whether data blocks are subject to a simultaneous
disk failure elsewhere. This includes ZFS and BTRfs integrity checksums
plus scrubbing, Erasure coding for GPFS et all, and N+M raid. By doing
periodic scrubs, you don't wait until the single failure to experience
the double. You proactively read back all of the blocks and remap
suspect ones early. (You also get the benefit of finding disk-bound bit
flips if you have that capability. It does happen. We get 1-2 per year
over 1PB of disk)
So, to some extent they create a strawman to knock it down, but it's
suspiciously unaware of the state of the art.
Replication, to me, solves a slightly different problem. It's big
application is protection against infrastructure loss (datacenter power
loss, network loss, fire, etc.). It doesn't save you from bit flips
without something else. It DEFINITELY doesn't protect you from the GIGO
problem. (garbage replicates just as effectively as data). Also, it
doesn't weed out disks that haven't been read for a while as scrubbing
would.
What happens in replication when your primary datacenter goes offline?
All of a sudden there's a big uptick in reads from your secondary and
"oh no", you have a disk failure from the load spike (the classic single
disk raid scenario). You still want RAID on your replicated data because
you still have the same problems (or lack of problems wrt decoupled,
erasure, scrubbing etc.)
yes, on the surface, RAID is not enough. But to paraphrase the old joke:
patient: "hey doc, it hurts when I close the door on my foot"
doc: "well, don't do that".
_______________________________________________
Tech mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
http://lopsa.org/