Brian Bockelman wrote:
Hey Alex,
In order to lower cost, you'll probably want to order the worker nodes
without hard drives then buy them separately. HDFS provides a
software-level RAID, so most of the reasonings behind buying hard drives
from Dell/HP are irrelevant - you are just paying an extra $400 per hard
drive. I know Dell sells the R410 which has 4 SATA bays; I'm sure Steve
knows an HP model that has something similar.
I will start the official disclaimer "I make no recommendations about
hardware" here, so as not to get into trouble. Talk to you reseller or
account team.
You can get servers with lots of drives in them DL180 and SL170z are
acronyms that spring to mind.
Big issues to consider
* server:CPU ratio
* power budget
* rack weight
* do you ever plan to stick in more CPUs? Some systems take this,
others don't.
* Intel vs AMD
* How much ECC RAM can you afford. And yes, it must be ECC.
server disks are higher RPM and specced for more hours than "consumer"
disks, I don't know what that means in terms of lifespan, but the RPM
translates into bandwidth off the disk.
However, BE VERY CAREFUL when you do this. From experience, a certain
large manufacturer (I don't know about Dell/HP) will refuse to ship (or
sell separately) hard drive trays if you order their machine without
hard drives. When this happened to us, we were not able to return the
machines because they were custom orders. Eventually, we had to get
someone to go to the machine shop and build 72 hard drive trays for us.
that is what physics PhD students are for, at least they didn't get a
lifetimes radiation dose for this job
Worst. Experience. Ever.
So, ALWAYS ASK and make sure that you can buy empty hard drive trays for
that specific model (or at least that it ships with them).
Brian
On Oct 15, 2009, at 10:48 AM, Alex Newman wrote:
So my company is looking at only using dell or hp for our
hadoop cluster and a sun thumper to backup the data. The prices are
ok, after a 40% discount, but realistically I am paying twice as much
as if I went to silicon mechanics, and with a much slower machine. It
seems as though the big expense are the disks. Even with a 40%
discount 550$ per 1tb disk seems crazy expensive. Also, they are
pushing me to build a smaller cluster (6 nodes) and I am pushing back
for nodes half the size but having twice as many. So how much of a
performance difference can I expect btwn 12 nodes with 1 xeon 5 series
running at 2.26 ghz 8 gigs of ram with 4 1 tb disks and a 6 node
cluster with 2 xeon 5 series running at 2.26 16 gigs of ram with 8 1
tb disks. Both setups will also have 2 very small sata drives in raid
1 for the OS. I will be doing some stuff with hadoop and a lot of
stuff with HBase. What are the considerations with HDFS performance
with a low number of nodes,etc.
It's an interesting Q as to what is better, fewer nodes with more
storage/CPU or more, smaller nodes.
Bigger servers
* more chance of running code near the data
* less data moved over the LAN at shuffle time
* RAM consumption can be more agile across tasks.
* increased chance of disk failure on a node; hadoop handles that very
badly right now (pre 0.20 -datanode goes offline)
Smaller servers
* easier to place data redundantly across machines
* less RAM taken up by other people's jobs
* more nodes stay up when a disk fails (less important on 0.20 onwards)
* when a node goes down, less data to re-replicate across the other
machines
1. I would like to hear other people's opinions,
2. The gridmix 2 benchmarking stuff tries to create synthetic benchmarks
from your real data runs. Try that, collect some data, then go to your
suppliers.
-Steve
COI disclaimer signature:
-----------------------
Hewlett-Packard Limited
Registered Office: Cain Road, Bracknell, Berks RG12 1HN
Registered No: 690597 England