Re: Advice on new Datacenter Hadoop Cluster?

tim robertson Thu, 01 Oct 2009 04:36:47 -0700

Disclaimer: I am pretty useless when it comes to hardware

I had a lot of issues with non ECC memory when running 100's millions
inserts from MapReduce into HBase on a dev cluster.  The errors were
checksum errors, and the consensus was the memory was causing the
issues and all advice was to ensure ECC memory.  The same cluster ran
without (any apparent) error for simple counting operations on tab
delimited files.


Cheers,
Tim

On Thu, Oct 1, 2009 at 11:49 AM, Steve Loughran <ste...@apache.org> wrote:
> Kevin Sweeney wrote:
>>
>> I really appreciate everyone's input. We've been going back and forth on
>> the
>> server size issue here. There are a few reasons we shot for the $1k price,
>> one because we wanted to be able to compare our datacenter costs vs. the
>> cloud costs. Another is that we have spec'd out a fast Intel node with
>> over-the-counter parts. We have a hard time justifying the dual-processor
>> costs and really don't see the need for the big server extras like
>> out-of-band management and redundancy. This is our proposed config, feel
>> free to criticize :)
>> Supermicro 512L-260 Chassis $90
>> Supermicro X8SIL                  $160
>> Heatsink                                $22
>> Intel 3460 Xeon                      $350
>> Samsung 7200 RPM SATA2   2x$85
>> 2GB Non-ECC DIMM              4x$65
>>
>> This totals $1052. Doesn't this seem like a reasonable setup? Isn't the
>> purpose of a hadoop cluster to build cheap,fast, replaceable nodes?
>
> Disclaimer 1: I work for a server vendor so may be biased. I will attempt to
> avoid this by not pointing you at HP DL180 or SL170z servers.
>
> Disclaimer 2: I probably don't know what I'm talking about. As far as Hadoop
> concerned, I'm not sure anyone knows what is "the right" configuration.
>
> * I'd consider ECC RAM. On a large cluster, over time, errors occur -you
> either notice them or propagate the effects.
>
> * Worry about power, cooling and rack weight.
>
> * Include network costs, power budget. That's your own switch costs, plus
> bandwidth in and out.
>
> * There are some good arguments in favour of fewer, higher end machines over
> many smaller ones.  Less network traffic, often a higher density.
>
> The  cloud hosted vs owned is an interesting question; I suspect the
> spreadsheet there is pretty complex
>
> * Estimate how much data you will want to store over time. On S3, those
> costs ramp up fast; in your own rack you can maybe plan to stick in in an
> extra 2TB HDD a year from now (space, power, cooling and weight permitting),
> paying next year's prices for next year's capacity.
>
> * Virtual machine management costs are different from physical management
> costs, especially if you dont invest time upfront on automating your
> datacentre software provisioning (custom RPMs, PXE preboot, kickstart, etc).
> VMMs you can almost hand manage an image (naughty, but possible), as long as
> you have a single image or two to push out. Even then, i'd automate, but at
> a higher level, creating images on demand as load/availablity sees fit.
>
> -Steve
>
>
>

Re: Advice on new Datacenter Hadoop Cluster?

Reply via email to