This is actually a very complex question.  Without trying to answer
completely, the high points, as I see it, are:
a) [Most important] Different kinds of nodes require different Hadoop
configurations.  In particular, the number of simultaneous tasks per node
should presumably be set higher for a many-core node than for a few-core
node.
b) More nodes (potentially) give you more disk controllers, and more memory
bus bandwidth shared by the disk controllers and RAM and CPUs.
c) More nodes give you (potentially, in a flat network fabric) more network
bandwidth between cores.
d) You can't always assume the cores are equivalent.

Details:

a) If all other issues were indeed equal, you'd configure
"mapred.tasktracker.map.tasks.maximum" and
"mapred.tasktracker.reduce.tasks.maximum" four times larger on an 8-core
system than a 2-core system.  In the real world, you'll want to experiment
to optimize the settings for the actual hardware and actual job streams
you're running.

b) If you're running modern server hardware, you've got DMA disk
controllers and a multi-GByte/sec memory bus, as well as bus controllers
that do a great job of multiplexing all the demands that share the FSB.
 However, as the disk count goes up and the budget goes down, you need to
look at whether you're going to saturate either the controller(s) or the
bus, given the local i/o access patterns of your particular workload.

c) Similarly, given the NIC cards in your servers and your rack/switch
topology, you need to ask whether your network i/o access patterns,
especially during shuffle/sort, will risk saturating your network bandwidth.

d) Make sure that the particular CPUs you are comparing actually have
comparable cores, because there's a world of difference between the
different cores included in dozens of different CPUs available!

Hope this helps.  Cheers,
--Matt

On Sun, Jul 1, 2012 at 4:13 AM, Safdar Kureishy
<safdar.kurei...@gmail.com>wrote:

> Hi,
>
> I have a reasonably simple question that I thought I'd post to this list
> because I don't have enough experience with hardware to figure this out
> myself.
>
> Let's assume that I have 2 separate cluster setups for slave nodes. The
> master node is a separate machine *outside* these clusters:
> *Setup A*: 28 nodes, each with a 2-core CPU, 8 GB RAM and 1 SATA drives (1
> TB each)
> *Setup B*: 7 nodes, each with a 8-core CPU, 32 GB Ram and 4 SATA drives (1
> TB each)
>
> Note that I have maintained the same *core:memory:spindle* ratio above. In
> essence, setup B has the same overall processing + memory + spindle
> capacity, but achieved with 4 times fewer nodes.
>
> Ignoring the* cost* of each node above, and assuming a 10Gb Ethernet
> connectivity and the same speed-per-core across nodes in both the scenarios
> above, are Setup A and Setup B equivalent to each other in the context of
> setting up a Hadoop cluster? Or will the relative performance be different?
> Excluding the network connectivity between the nodes, what would be some
> other criteria that might give one setup an edge over the other, for
> regular Hadoop jobs?
>
> Also, assuming the same type of Hadoop jobs on both clusters, how different
> would the load experienced by the master node be for each setup above?
>
> Thanks in advance,
> Safdar
>

Reply via email to