On 2/19/2014 4:52 PM, Andrew Hume wrote:
an almost counter-intuitive finding.

http://cloud.media.seagate.com/2014/02/18/when-is-my-data-too-big-for-a-raid-storage-solution/?utm_source=linkedin&utm_medium=social&utm_content=Oktopost-LinkedIn-Group&utm_campaign=%28Oktopost%29Feb+2014



This article ignores (purposefully? unintentionally?) some current trends that are highly relevant.

1) the trend towards decoupled RAID. There is nothing that says that you have to RAID across a full drive. Recent trends (Isilon, GPFS GNR) divide the disks into chunks and 'protect' (Reed solomon, XOR, erasure, or other encoding) across them. When a disk fails, those chunks get rebuilt in parallel across <n> other disks. Thus, single disk failures are not proportional to the size of the disk, and the size of the disk is irrelevant. 5-10 minutes to rebuild all of the parallel chunks that happen to have been on that disk is typical. There's no gain in overhead because the overall amount of data protection is the same.

2) the trend towards data verification that doesn't wait until a drive fails to determine whether data blocks are subject to a simultaneous disk failure elsewhere. This includes ZFS and BTRfs integrity checksums plus scrubbing, Erasure coding for GPFS et all, and N+M raid. By doing periodic scrubs, you don't wait until the single failure to experience the double. You proactively read back all of the blocks and remap suspect ones early. (You also get the benefit of finding disk-bound bit flips if you have that capability. It does happen. We get 1-2 per year over 1PB of disk)

So, to some extent they create a strawman to knock it down, but it's suspiciously unaware of the state of the art.

Replication, to me, solves a slightly different problem. It's big application is protection against infrastructure loss (datacenter power loss, network loss, fire, etc.). It doesn't save you from bit flips without something else. It DEFINITELY doesn't protect you from the GIGO problem. (garbage replicates just as effectively as data). Also, it doesn't weed out disks that haven't been read for a while as scrubbing would.

What happens in replication when your primary datacenter goes offline? All of a sudden there's a big uptick in reads from your secondary and "oh no", you have a disk failure from the load spike (the classic single disk raid scenario). You still want RAID on your replicated data because you still have the same problems (or lack of problems wrt decoupled, erasure, scrubbing etc.)

yes, on the surface, RAID is not enough. But to paraphrase the old joke:
patient: "hey doc, it hurts when I close the door on my foot"
doc: "well, don't do that".



_______________________________________________
Tech mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
http://lopsa.org/

Reply via email to