On Jan 25, 2011, at 5:08 PM, stu24m...@yahoo.com wrote:

> I don't think, as a recovery strategy, RAID scales to large amounts of data. 
> Even as some kind of attached storage device (e.g. Vtrack), you're only 
> talking about a few terabytes of data, and it doesn't tolerate node failure.

When talking about large amounts of data, 3x redundancy absolutely doesn't 
scale.  Nobody is going to pay for 3 petabytes worth of disk if they only need 
1 PB worth of data.  This is where dedicated high-end raid systems come in 
(this is in fact what my company, Xyratex, builds).  Redundant controllers, 
battery backup, etc.  The incremental cost for an additional drive in such 
systems is negligible.  

> 
> A key part of hdfs is the distributed part.

Granted, single-point-of-failure arguments are valid when concentrating all the 
storage together, but can be generally dealt with using hardware and software 
failover techniques.   

The scale argument in my mind is exactly reversed -- HDFS works fine for 
smaller installations that can't afford RAID hardware overhead and access 
redundancy, and where buying 30 drives instead of 10 is an acceptable cost for 
the simplicity of HDFS setup.

> 
> Best,
> -stu
> -----Original Message-----
> From: Nathan Rutman <nrut...@gmail.com>
> Date: Tue, 25 Jan 2011 16:32:07 
> To: <hdfs-user@hadoop.apache.org>
> Reply-To: hdfs-user@hadoop.apache.org
> Subject: Re: HDFS without Hadoop: Why?
> 
> 
> On Jan 25, 2011, at 3:56 PM, Gerrit Jansen van Vuuren wrote:
> 
>> Hi,
>> 
>> Why would 3x data seem wasteful? 
>> This is exactly what you want.  I would never store any serious business 
>> data without some form of replication.
> 
> I agree that you want data backup, but 3x replication is the least efficient 
> / most expensive (space-wise) way to do it.  This is what RAID was invented 
> for: RAID 6 gives you fault tolerance against loss of any two drives, for 
> only 20% disk space overhead.  (Sorry, I see I forgot to note this in my 
> original email, but that's what I had in mind.) RAID is also not necessarily 
> $ expensive either; Linux MD RAID is free and effective.
> 
>> What happens if you store a single file on a single server without replicas 
>> and that server goes, or just the disk on that the file is on goes ? HDFS 
>> and any decent distributed file system uses replication to prevent data 
>> loss. As a side affect having the same replica of a data piece on separate 
>> servers means that more than one task can work on the server in parallel.
> 
> Indeed, replicated data does mean Hadoop could work on the same block on 
> separate nodes.  But outside of Hadoop compute jobs, I don't think this is 
> useful in general.  And in any case, a distributed filesystem would let you 
> work on the same block of data from however many nodes you wanted.
> 
> 

Reply via email to