Hi,

We at Yahoo did some Hadoop benchmarking experiments on clusters with JBOD
and RAID0. We found that under heavy loads (such as gridmix), JBOD cluster
performed better.

Gridmix tests:

Load: gridmix2
Cluster size: 190 nodes
Test results:

RAID0: 75 minutes
JBOD:  67 minutes
Difference: 10%

Tests on HDFS writes performances

We ran map only jobs writing data to dfs concurrently on different clusters.
The overall dfs write throughputs on the jbod cluster are 30% (with a 58
nodes cluster) and 50% (with an 18 nodes cluster) better than that on the
raid0 cluster, respectively.

To understand why, we did some file level benchmarking on both clusters.
We found that the file write throughput on a JBOD machine is 30% higher than
that on a comparable machine with RAID0. This performance difference may be
explained by the fact that the throughputs of different disks can vary 30%
to 50%. With such variations, the overall throughput of a raid0 system may
be bottlenecked by the slowest disk.
 

-- Runping





On 1/11/09 1:23 PM, "David B. Ritch" <david.ri...@gmail.com> wrote:

> How well does Hadoop handle multiple independent disks per node?
> 
> I have a cluster with 4 identical disks per node.  I plan to use one
> disk for OS and temporary storage, and dedicate the other three to
> HDFS.  Our IT folks have some disagreement as to whether the three disks
> should be striped, or treated by HDFS as three independent disks.  Could
> someone with more HDFS experience comment on the relative advantages and
> disadvantages to each approach?
> 
> Here are some of my thoughts.  It's a bit easier to manage a 3-disk
> striped partition, and we wouldn't have to worry about balancing files
> between them.  Single-file I/O should be considerably faster.  On the
> other hand, I would expect typical use to require multiple files reads
> or write simultaneously.  I would expect Hadoop to be able to manage
> read/write to/from the disks independently.  Managing 3 streams to 3
> independent devices would likely result in less disk head movement, and
> therefore better performance.  I would expect Hadoop to be able to
> balance load between the disks fairly well.  Availability doesn't really
> differentiate between the two approaches - if a single disk dies, the
> striped array would go down, but all its data should be replicated on
> another datanode, anyway.  And besides, I understand that datanode will
> shut down a node, even if only one of 3 independent disks crashes.
> 
> So - any one want to agree or disagree with these thoughts?  Anyone have
> any other ideas, or - better - benchmarks and experience with layouts
> like these two?
> 
> Thanks!
> 
> David

Reply via email to