Brian Bockelman wrote:
Hey Alex,

In order to lower cost, you'll probably want to order the worker nodes without hard drives then buy them separately. HDFS provides a software-level RAID, so most of the reasonings behind buying hard drives from Dell/HP are irrelevant - you are just paying an extra $400 per hard drive. I know Dell sells the R410 which has 4 SATA bays; I'm sure Steve knows an HP model that has something similar.

I will start the official disclaimer "I make no recommendations about hardware" here, so as not to get into trouble. Talk to you reseller or account team.

You can get servers with lots of drives in them DL180 and SL170z are acronyms that spring to mind.
Big issues to consider
 * server:CPU ratio
 * power budget
 * rack weight
* do you ever plan to stick in more CPUs? Some systems take this, others don't.
 * Intel vs AMD
 * How much ECC RAM can you afford. And yes, it must be ECC.

server disks are higher RPM and specced for more hours than "consumer" disks, I don't know what that means in terms of lifespan, but the RPM translates into bandwidth off the disk.


However, BE VERY CAREFUL when you do this. From experience, a certain large manufacturer (I don't know about Dell/HP) will refuse to ship (or sell separately) hard drive trays if you order their machine without hard drives. When this happened to us, we were not able to return the machines because they were custom orders. Eventually, we had to get someone to go to the machine shop and build 72 hard drive trays for us.

that is what physics PhD students are for, at least they didn't get a lifetimes radiation dose for this job


Worst. Experience. Ever.

So, ALWAYS ASK and make sure that you can buy empty hard drive trays for that specific model (or at least that it ships with them).

Brian


On Oct 15, 2009, at 10:48 AM, Alex Newman wrote:

         So my company is looking at only using dell or hp for our
hadoop cluster and a sun thumper to backup the data. The prices are
ok, after a 40% discount, but realistically I am paying twice as much
as if I went to silicon mechanics, and with a much slower machine. It
seems as though the big expense are the disks. Even with a 40%
discount 550$ per 1tb disk seems crazy expensive. Also, they are
pushing me to build a smaller cluster (6 nodes) and I am pushing back
for nodes half the size but having twice as many. So how much of a
performance difference can I expect btwn 12 nodes with 1 xeon 5 series
running at 2.26 ghz 8 gigs of ram with 4 1 tb disks and a 6 node
cluster with 2 xeon 5 series running at 2.26 16 gigs of ram with 8 1
tb disks. Both setups will also have 2 very small sata drives in raid
1 for the OS. I will be doing some stuff with hadoop and a lot of
stuff with HBase. What are the considerations with HDFS performance
with a low number of nodes,etc.



It's an interesting Q as to what is better, fewer nodes with more storage/CPU or more, smaller nodes.

Bigger servers
 * more chance of running code near the data
 * less data moved over the LAN at shuffle time
 * RAM consumption can be more agile across tasks.
* increased chance of disk failure on a node; hadoop handles that very badly right now (pre 0.20 -datanode goes offline)

Smaller servers
 * easier to place data redundantly across machines
 * less RAM taken up by other people's jobs
 * more nodes stay up when a disk fails (less important on 0.20 onwards)
* when a node goes down, less data to re-replicate across the other machines

1. I would like to hear other people's opinions,

2. The gridmix 2 benchmarking stuff tries to create synthetic benchmarks from your real data runs. Try that, collect some data, then go to your suppliers.

-Steve

COI disclaimer signature:
-----------------------
Hewlett-Packard Limited
Registered Office: Cain Road, Bracknell, Berks RG12 1HN
Registered No: 690597 England

Reply via email to