Hi Nathan, we are using HDFS-RAID for our 30 PB cluster. Most datasets have a replication factor of 2.2 and a few datasets have a replication factor of 1.4. Some details here:
http://wiki.apache.org/hadoop/HDFS-RAID http://hadoopblog.blogspot.com/2009/08/hdfs-and-erasure-codes-hdfs-raid.html thanks, dhruba On Tue, Jan 25, 2011 at 7:58 PM, <stu24m...@yahoo.com> wrote: > My point was it's not RAID or whatr versus HDFS. HDFS is a distributed file > system that solves different problems. > > HDFS is a file system. It's like asking NTFS or RAID? > > >but can be generally dealt with using hardware and software failover > techniques. > > Like hdfs. > > Best, > -stu > -----Original Message----- > From: Nathan Rutman <nrut...@gmail.com> > Date: Tue, 25 Jan 2011 17:31:25 > To: <hdfs-user@hadoop.apache.org> > Reply-To: hdfs-user@hadoop.apache.org > Subject: Re: HDFS without Hadoop: Why? > > > On Jan 25, 2011, at 5:08 PM, stu24m...@yahoo.com wrote: > > > I don't think, as a recovery strategy, RAID scales to large amounts of > data. Even as some kind of attached storage device (e.g. Vtrack), you're > only talking about a few terabytes of data, and it doesn't tolerate node > failure. > > When talking about large amounts of data, 3x redundancy absolutely doesn't > scale. Nobody is going to pay for 3 petabytes worth of disk if they only > need 1 PB worth of data. This is where dedicated high-end raid systems come > in (this is in fact what my company, Xyratex, builds). Redundant > controllers, battery backup, etc. The incremental cost for an additional > drive in such systems is negligible. > > > > > A key part of hdfs is the distributed part. > > Granted, single-point-of-failure arguments are valid when concentrating all > the storage together, but can be generally dealt with using hardware and > software failover techniques. > > The scale argument in my mind is exactly reversed -- HDFS works fine for > smaller installations that can't afford RAID hardware overhead and access > redundancy, and where buying 30 drives instead of 10 is an acceptable cost > for the simplicity of HDFS setup. > > > > > Best, > > -stu > > -----Original Message----- > > From: Nathan Rutman <nrut...@gmail.com> > > Date: Tue, 25 Jan 2011 16:32:07 > > To: <hdfs-user@hadoop.apache.org> > > Reply-To: hdfs-user@hadoop.apache.org > > Subject: Re: HDFS without Hadoop: Why? > > > > > > On Jan 25, 2011, at 3:56 PM, Gerrit Jansen van Vuuren wrote: > > > >> Hi, > >> > >> Why would 3x data seem wasteful? > >> This is exactly what you want. I would never store any serious business > data without some form of replication. > > > > I agree that you want data backup, but 3x replication is the least > efficient / most expensive (space-wise) way to do it. This is what RAID was > invented for: RAID 6 gives you fault tolerance against loss of any two > drives, for only 20% disk space overhead. (Sorry, I see I forgot to note > this in my original email, but that's what I had in mind.) RAID is also not > necessarily $ expensive either; Linux MD RAID is free and effective. > > > >> What happens if you store a single file on a single server without > replicas and that server goes, or just the disk on that the file is on goes > ? HDFS and any decent distributed file system uses replication to prevent > data loss. As a side affect having the same replica of a data piece on > separate servers means that more than one task can work on the server in > parallel. > > > > Indeed, replicated data does mean Hadoop could work on the same block on > separate nodes. But outside of Hadoop compute jobs, I don't think this is > useful in general. And in any case, a distributed filesystem would let you > work on the same block of data from however many nodes you wanted. > > > > > > -- Connect to me at http://www.facebook.com/dhruba