Re: HDFS without Hadoop: Why?

Dhruba Borthakur Tue, 25 Jan 2011 21:54:37 -0800

Hi Nathan,

we are using HDFS-RAID for our 30 PB cluster. Most datasets have a
replication factor of 2.2 and a few datasets have a replication factor of
1.4.  Some details here:


http://wiki.apache.org/hadoop/HDFS-RAID
http://hadoopblog.blogspot.com/2009/08/hdfs-and-erasure-codes-hdfs-raid.html

thanks,
dhruba


On Tue, Jan 25, 2011 at 7:58 PM, <stu24m...@yahoo.com> wrote:

> My point was it's not RAID or whatr versus HDFS. HDFS is a distributed file
> system that solves different problems.
>
>  HDFS is a file system. It's like asking NTFS or RAID?
>
> >but can be generally dealt with using hardware and software failover
> techniques.
>
> Like hdfs.
>
> Best,
>  -stu
> -----Original Message-----
> From: Nathan Rutman <nrut...@gmail.com>
> Date: Tue, 25 Jan 2011 17:31:25
> To: <hdfs-user@hadoop.apache.org>
> Reply-To: hdfs-user@hadoop.apache.org
> Subject: Re: HDFS without Hadoop: Why?
>
>
> On Jan 25, 2011, at 5:08 PM, stu24m...@yahoo.com wrote:
>
> > I don't think, as a recovery strategy, RAID scales to large amounts of
> data. Even as some kind of attached storage device (e.g. Vtrack), you're
> only talking about a few terabytes of data, and it doesn't tolerate node
> failure.
>
> When talking about large amounts of data, 3x redundancy absolutely doesn't
> scale.  Nobody is going to pay for 3 petabytes worth of disk if they only
> need 1 PB worth of data.  This is where dedicated high-end raid systems come
> in (this is in fact what my company, Xyratex, builds).  Redundant
> controllers, battery backup, etc.  The incremental cost for an additional
> drive in such systems is negligible.
>
> >
> > A key part of hdfs is the distributed part.
>
> Granted, single-point-of-failure arguments are valid when concentrating all
> the storage together, but can be generally dealt with using hardware and
> software failover techniques.
>
> The scale argument in my mind is exactly reversed -- HDFS works fine for
> smaller installations that can't afford RAID hardware overhead and access
> redundancy, and where buying 30 drives instead of 10 is an acceptable cost
> for the simplicity of HDFS setup.
>
> >
> > Best,
> > -stu
> > -----Original Message-----
> > From: Nathan Rutman <nrut...@gmail.com>
> > Date: Tue, 25 Jan 2011 16:32:07
> > To: <hdfs-user@hadoop.apache.org>
> > Reply-To: hdfs-user@hadoop.apache.org
> > Subject: Re: HDFS without Hadoop: Why?
> >
> >
> > On Jan 25, 2011, at 3:56 PM, Gerrit Jansen van Vuuren wrote:
> >
> >> Hi,
> >>
> >> Why would 3x data seem wasteful?
> >> This is exactly what you want.  I would never store any serious business
> data without some form of replication.
> >
> > I agree that you want data backup, but 3x replication is the least
> efficient / most expensive (space-wise) way to do it.  This is what RAID was
> invented for: RAID 6 gives you fault tolerance against loss of any two
> drives, for only 20% disk space overhead.  (Sorry, I see I forgot to note
> this in my original email, but that's what I had in mind.) RAID is also not
> necessarily $ expensive either; Linux MD RAID is free and effective.
> >
> >> What happens if you store a single file on a single server without
> replicas and that server goes, or just the disk on that the file is on goes
> ? HDFS and any decent distributed file system uses replication to prevent
> data loss. As a side affect having the same replica of a data piece on
> separate servers means that more than one task can work on the server in
> parallel.
> >
> > Indeed, replicated data does mean Hadoop could work on the same block on
> separate nodes.  But outside of Hadoop compute jobs, I don't think this is
> useful in general.  And in any case, a distributed filesystem would let you
> work on the same block of data from however many nodes you wanted.
> >
> >
>
>


-- 
Connect to me at http://www.facebook.com/dhruba

Re: HDFS without Hadoop: Why?

Reply via email to