David,

As I understand it, you will theoretically get better performance from a
JBOD configuration than a RAID configuration.  In a RAID configuration,
you have to wait for the slowest disk in the array to complete before
the entire IO operation can complete, making the average IO time
equivalent to the slowest disk.  In a JBOD configuration, operations on
a faster disks will complete independently of the slowest disk, making
the average IO time for the node necessarily faster than the slowest
disk (unless all disks are equally slow).

Whether it would be a noticeable gain is questionable, though.  I doubt
it would make enough difference to provide a good reason to depart from
whichever you feel is easiest to manage.

And you don't need the redundancy of RAID, since HDFS does that using
replication between nodes, so there's no loss there.

Brian

David B. Ritch wrote:
> Thank you - yes, I'm fairly confident that it will work either way.  I'm
> trying to find out whether there is an established best practice, and
> the performance impact of the decision between RAID 0 and JBOD.
> I'll check out the noatime and nodiratime for their effect on our
> performance - thanks for that suggestion, as well.
> 
> David
> Jason Venner wrote:
>> If you put your dfs directory as a set of comma separated tokens you
>> will do fine.
>>
>> <property>
>>  <name>dfs.data.dir</name>
>>  <value>${hadoop.tmp.dir}/dfs/data</value>
>>  <description>Determines where on the local filesystem an DFS data node
>>  should store its blocks.  If this is a comma-delimited
>>  list of directories, then data will be stored in all named
>>  directories, typically on different devices.
>>  Directories that do not exist are ignored.
>>  </description>
>> </property>
>>
>> The namenode does a lot of small writes, so raid 1, 10 is better.
>>
>> Also it having the file system mounts for the dfs.data.dir be noatime
>> and nodiratime makes a significant performance difference.
>>
>> David B. Ritch wrote:
>>> How well does Hadoop handle multiple independent disks per node?
>>>
>>> I have a cluster with 4 identical disks per node.  I plan to use one
>>> disk for OS and temporary storage, and dedicate the other three to
>>> HDFS.  Our IT folks have some disagreement as to whether the three disks
>>> should be striped, or treated by HDFS as three independent disks.  Could
>>> someone with more HDFS experience comment on the relative advantages and
>>> disadvantages to each approach?
>>>
>>> Here are some of my thoughts.  It's a bit easier to manage a 3-disk
>>> striped partition, and we wouldn't have to worry about balancing files
>>> between them.  Single-file I/O should be considerably faster.  On the
>>> other hand, I would expect typical use to require multiple files reads
>>> or write simultaneously.  I would expect Hadoop to be able to manage
>>> read/write to/from the disks independently.  Managing 3 streams to 3
>>> independent devices would likely result in less disk head movement, and
>>> therefore better performance.  I would expect Hadoop to be able to
>>> balance load between the disks fairly well.  Availability doesn't really
>>> differentiate between the two approaches - if a single disk dies, the
>>> striped array would go down, but all its data should be replicated on
>>> another datanode, anyway.  And besides, I understand that datanode will
>>> shut down a node, even if only one of 3 independent disks crashes.
>>>
>>> So - any one want to agree or disagree with these thoughts?  Anyone have
>>> any other ideas, or - better - benchmarks and experience with layouts
>>> like these two?
>>>
>>> Thanks!
>>>
>>> David
>>>   
> 

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to