I would consider a jbod with 16-64mb stride. This would be a choice where one or more (e.g. MR) steps will be io bound. Otherwise one or more tasks will be hit with the low read/write times of having large amounts of data behind a single spindle On Nov 12, 2014 8:37 AM, "Brian C. Huffman" <[email protected]> wrote:
> All, > > I'm setting up a 4-node Hadoop 2.5.1 cluster. Each node has the following > drives: > 1 - 500GB drive (OS disk) > 1 - 500GB drive > 1 - 2 TB drive > 1 - 3 TB drive. > > In past experience I've had lots of issues with non-uniform drive sizes > for HDFS, but unfortunately it wasn't an option to get all 3TB or 2TB > drives for this cluster. > > My thought is to set up the 2TB and 3TB drives as HDFS and the 500GB drive > as intermediate data. Most our of jobs don't make large use of > intermediate data, but at least this way, I get a good amount of space > (2TB) per node before I run into issues. Then I may end up using the > AvailableSpaceVolumeChoosingPolicy > to help with balancing the blocks. > > If necessary I could put intermediate data on one of the OS partitions > (/home). But this doesn't seem ideal. > > Anybody have any recommendations regarding the optimal use of storage in > this scenario? > > Thanks, > Brian >
