More cores Vs More Nodes ?

2011-12-12 Thread praveenesh kumar
Hey Guys,

So I have a very naive question in my mind regarding Hadoop cluster nodes ?

more cores or more nodes – Shall I spend money on going from 2-4 core
machines, or spend money on buying more nodes less core eg. say 2
machines of 2 cores for example?

Thanks,
Praveenesh


RE: More cores Vs More Nodes ?

2011-12-13 Thread Brad Sarsfield
Praveenesh,

Your question is not naïve; in fact, optimal hardware design can ultimately be 
a very difficult question to answer on what would be "better". If you made me 
pick one without much information I'd go for more machines.  But...

It all depends; and there is no right answer :)   

More machines 
+May run your workload faster
+Will give you a higher degree of reliability protection from node / 
hardware / hard drive failure.
+More aggregate IO capabilities
- capex / opex may be higher than allocating more cores
More cores 
+May run your workload faster
+More cores may allow for more tasks to run on the same machine
+More cores/tasks may reduce network contention and increase increasing 
task to task data flow performance.

Notice "May run your workload faster" is in both; as it can be very workload 
dependant.

My Experience:
I did a recent experiment and found that given the same number of cores (64) 
with the exact same network / machine configuration; 
A: I had 8 machines with 8 cores 
B: I had 28 machines with 2 cores (and 1x8 core head node)

B was able to outperform A by 2x using teragen and terasort. These machines 
were running in a virtualized environment; where some of the IO capabilities 
behind the scenes were being regulated to 400Mbps per node when running in the 
2 core configuration vs 1Gbps on the 8 core.  So I would expect the 
non-throttled scenario to work even better. 

~Brad


-Original Message-
From: praveenesh kumar [mailto:praveen...@gmail.com] 
Sent: Monday, December 12, 2011 8:51 PM
To: common-user@hadoop.apache.org
Subject: More cores Vs More Nodes ?

Hey Guys,

So I have a very naive question in my mind regarding Hadoop cluster nodes ?

more cores or more nodes - Shall I spend money on going from 2-4 core machines, 
or spend money on buying more nodes less core eg. say 2 machines of 2 cores for 
example?

Thanks,
Praveenesh



Re: More cores Vs More Nodes ?

2011-12-13 Thread Prashant Kommireddi
Hi Brad, how many taskstrackers did you have on each node in both cases?

Thanks,
Prashant

Sent from my iPhone

On Dec 13, 2011, at 9:42 AM, Brad Sarsfield  wrote:

> Praveenesh,
>
> Your question is not naïve; in fact, optimal hardware design can ultimately 
> be a very difficult question to answer on what would be "better". If you made 
> me pick one without much information I'd go for more machines.  But...
>
> It all depends; and there is no right answer :)
>
> More machines
>+May run your workload faster
>+Will give you a higher degree of reliability protection from node / 
> hardware / hard drive failure.
>+More aggregate IO capabilities
>- capex / opex may be higher than allocating more cores
> More cores
>+May run your workload faster
>+More cores may allow for more tasks to run on the same machine
>+More cores/tasks may reduce network contention and increase increasing 
> task to task data flow performance.
>
> Notice "May run your workload faster" is in both; as it can be very workload 
> dependant.
>
> My Experience:
> I did a recent experiment and found that given the same number of cores (64) 
> with the exact same network / machine configuration;
>A: I had 8 machines with 8 cores
>B: I had 28 machines with 2 cores (and 1x8 core head node)
>
> B was able to outperform A by 2x using teragen and terasort. These machines 
> were running in a virtualized environment; where some of the IO capabilities 
> behind the scenes were being regulated to 400Mbps per node when running in 
> the 2 core configuration vs 1Gbps on the 8 core.  So I would expect the 
> non-throttled scenario to work even better.
>
> ~Brad
>
>
> -Original Message-
> From: praveenesh kumar [mailto:praveen...@gmail.com]
> Sent: Monday, December 12, 2011 8:51 PM
> To: common-user@hadoop.apache.org
> Subject: More cores Vs More Nodes ?
>
> Hey Guys,
>
> So I have a very naive question in my mind regarding Hadoop cluster nodes ?
>
> more cores or more nodes - Shall I spend money on going from 2-4 core 
> machines, or spend money on buying more nodes less core eg. say 2 machines of 
> 2 cores for example?
>
> Thanks,
> Praveenesh
>


RE: More cores Vs More Nodes ?

2011-12-13 Thread Tom Deutsch
It also helps to know the profile of your job in how you spec the 
machines. So in addition to Brad's response you should consider if you 
think your jobs will be more storage or compute oriented. 


Tom Deutsch
Program Director
Information Management
Big Data Technologies
IBM
3565 Harbor Blvd
Costa Mesa, CA 92626-1420
tdeut...@us.ibm.com




Brad Sarsfield  
12/13/2011 09:41 AM
Please respond to
common-user@hadoop.apache.org


To
"common-user@hadoop.apache.org" 
cc

Subject
RE: More cores Vs More Nodes ?






Praveenesh,

Your question is not naïve; in fact, optimal hardware design can 
ultimately be a very difficult question to answer on what would be 
"better". If you made me pick one without much information I'd go for more 
machines.  But...

It all depends; and there is no right answer :) 

More machines 
 +May run your workload faster
 +Will give you a higher degree of reliability protection 
from node / hardware / hard drive failure.
 +More aggregate IO capabilities
 - capex / opex may be higher than allocating more cores
More cores 
 +May run your workload faster
 +More cores may allow for more tasks to run on the same 
machine
 +More cores/tasks may reduce network contention and 
increase increasing task to task data flow performance.

Notice "May run your workload faster" is in both; as it can be very 
workload dependant.

My Experience:
I did a recent experiment and found that given the same number of cores 
(64) with the exact same network / machine configuration; 
 A: I had 8 machines with 8 cores 
 B: I had 28 machines with 2 cores (and 1x8 core head 
node)

B was able to outperform A by 2x using teragen and terasort. These 
machines were running in a virtualized environment; where some of the IO 
capabilities behind the scenes were being regulated to 400Mbps per node 
when running in the 2 core configuration vs 1Gbps on the 8 core.  So I 
would expect the non-throttled scenario to work even better. 

~Brad


-Original Message-
From: praveenesh kumar [mailto:praveen...@gmail.com] 
Sent: Monday, December 12, 2011 8:51 PM
To: common-user@hadoop.apache.org
Subject: More cores Vs More Nodes ?

Hey Guys,

So I have a very naive question in my mind regarding Hadoop cluster nodes 
?

more cores or more nodes - Shall I spend money on going from 2-4 core 
machines, or spend money on buying more nodes less core eg. say 2 machines 
of 2 cores for example?

Thanks,
Praveenesh




Re: More cores Vs More Nodes ?

2011-12-13 Thread real great..
more cores might help in hadoop environments as there would be more data
locality.
your thoughts?

On Tue, Dec 13, 2011 at 11:11 PM, Brad Sarsfield  wrote:

> Praveenesh,
>
> Your question is not naïve; in fact, optimal hardware design can
> ultimately be a very difficult question to answer on what would be
> "better". If you made me pick one without much information I'd go for more
> machines.  But...
>
> It all depends; and there is no right answer :)
>
> More machines
>+May run your workload faster
>+Will give you a higher degree of reliability protection from node
> / hardware / hard drive failure.
>+More aggregate IO capabilities
>- capex / opex may be higher than allocating more cores
> More cores
>+May run your workload faster
>+More cores may allow for more tasks to run on the same machine
>+More cores/tasks may reduce network contention and increase
> increasing task to task data flow performance.
>
> Notice "May run your workload faster" is in both; as it can be very
> workload dependant.
>
> My Experience:
> I did a recent experiment and found that given the same number of cores
> (64) with the exact same network / machine configuration;
>A: I had 8 machines with 8 cores
>B: I had 28 machines with 2 cores (and 1x8 core head node)
>
> B was able to outperform A by 2x using teragen and terasort. These
> machines were running in a virtualized environment; where some of the IO
> capabilities behind the scenes were being regulated to 400Mbps per node
> when running in the 2 core configuration vs 1Gbps on the 8 core.  So I
> would expect the non-throttled scenario to work even better.
>
> ~Brad
>
>
> -Original Message-
> From: praveenesh kumar [mailto:praveen...@gmail.com]
> Sent: Monday, December 12, 2011 8:51 PM
> To: common-user@hadoop.apache.org
> Subject: More cores Vs More Nodes ?
>
> Hey Guys,
>
> So I have a very naive question in my mind regarding Hadoop cluster nodes ?
>
> more cores or more nodes - Shall I spend money on going from 2-4 core
> machines, or spend money on buying more nodes less core eg. say 2 machines
> of 2 cores for example?
>
> Thanks,
> Praveenesh
>
>


-- 
Regards,
R.V.


Re: More cores Vs More Nodes ?

2011-12-13 Thread Alexander Pivovarov
more nodes means more IO on read on mapper step
If you use combiners you might need to send only small amount of data over
network to reducers

Alexander


On Tue, Dec 13, 2011 at 12:45 PM, real great..  wrote:

> more cores might help in hadoop environments as there would be more data
> locality.
> your thoughts?
>
> On Tue, Dec 13, 2011 at 11:11 PM, Brad Sarsfield  wrote:
>
> > Praveenesh,
> >
> > Your question is not naïve; in fact, optimal hardware design can
> > ultimately be a very difficult question to answer on what would be
> > "better". If you made me pick one without much information I'd go for
> more
> > machines.  But...
> >
> > It all depends; and there is no right answer :)
> >
> > More machines
> >+May run your workload faster
> >+Will give you a higher degree of reliability protection from node
> > / hardware / hard drive failure.
> >+More aggregate IO capabilities
> >- capex / opex may be higher than allocating more cores
> > More cores
> >+May run your workload faster
> >+More cores may allow for more tasks to run on the same machine
> >+More cores/tasks may reduce network contention and increase
> > increasing task to task data flow performance.
> >
> > Notice "May run your workload faster" is in both; as it can be very
> > workload dependant.
> >
> > My Experience:
> > I did a recent experiment and found that given the same number of cores
> > (64) with the exact same network / machine configuration;
> >A: I had 8 machines with 8 cores
> >B: I had 28 machines with 2 cores (and 1x8 core head node)
> >
> > B was able to outperform A by 2x using teragen and terasort. These
> > machines were running in a virtualized environment; where some of the IO
> > capabilities behind the scenes were being regulated to 400Mbps per node
> > when running in the 2 core configuration vs 1Gbps on the 8 core.  So I
> > would expect the non-throttled scenario to work even better.
> >
> > ~Brad
> >
> >
> > -Original Message-
> > From: praveenesh kumar [mailto:praveen...@gmail.com]
> > Sent: Monday, December 12, 2011 8:51 PM
> > To: common-user@hadoop.apache.org
> > Subject: More cores Vs More Nodes ?
> >
> > Hey Guys,
> >
> > So I have a very naive question in my mind regarding Hadoop cluster
> nodes ?
> >
> > more cores or more nodes - Shall I spend money on going from 2-4 core
> > machines, or spend money on buying more nodes less core eg. say 2
> machines
> > of 2 cores for example?
> >
> > Thanks,
> > Praveenesh
> >
> >
>
>
> --
> Regards,
> R.V.
>


Re: More cores Vs More Nodes ?

2011-12-13 Thread bharath vissapragada
Hey there,

I agree with Tom's response. One can decide it based on the type of jobs
you run. I have been working on Hive and I realized that increasing no. of
cores would give very good performance boost because joins and stuff are
compute oriented and consume a lot of CPU on reduce side. This may not be
the case with other applications (like HBase? )

Thanks

So I feel that you shou

On Tue, Dec 13, 2011 at 11:16 PM, Tom Deutsch  wrote:

> It also helps to know the profile of your job in how you spec the
> machines. So in addition to Brad's response you should consider if you
> think your jobs will be more storage or compute oriented.
>
> 
> Tom Deutsch
> Program Director
> Information Management
> Big Data Technologies
> IBM
> 3565 Harbor Blvd
> Costa Mesa, CA 92626-1420
> tdeut...@us.ibm.com
>
>
>
>
> Brad Sarsfield 
> 12/13/2011 09:41 AM
> Please respond to
> common-user@hadoop.apache.org
>
>
> To
> "common-user@hadoop.apache.org" 
> cc
>
> Subject
> RE: More cores Vs More Nodes ?
>
>
>
>
>
>
> Praveenesh,
>
> Your question is not naïve; in fact, optimal hardware design can
> ultimately be a very difficult question to answer on what would be
> "better". If you made me pick one without much information I'd go for more
> machines.  But...
>
> It all depends; and there is no right answer :)
>
> More machines
> +May run your workload faster
> +Will give you a higher degree of reliability protection
> from node / hardware / hard drive failure.
> +More aggregate IO capabilities
> - capex / opex may be higher than allocating more cores
> More cores
> +May run your workload faster
> +More cores may allow for more tasks to run on the same
> machine
> +More cores/tasks may reduce network contention and
> increase increasing task to task data flow performance.
>
> Notice "May run your workload faster" is in both; as it can be very
> workload dependant.
>
> My Experience:
> I did a recent experiment and found that given the same number of cores
> (64) with the exact same network / machine configuration;
> A: I had 8 machines with 8 cores
> B: I had 28 machines with 2 cores (and 1x8 core head
> node)
>
> B was able to outperform A by 2x using teragen and terasort. These
> machines were running in a virtualized environment; where some of the IO
> capabilities behind the scenes were being regulated to 400Mbps per node
> when running in the 2 core configuration vs 1Gbps on the 8 core.  So I
> would expect the non-throttled scenario to work even better.
>
> ~Brad
>
>
> -Original Message-
> From: praveenesh kumar [mailto:praveen...@gmail.com]
> Sent: Monday, December 12, 2011 8:51 PM
> To: common-user@hadoop.apache.org
> Subject: More cores Vs More Nodes ?
>
> Hey Guys,
>
> So I have a very naive question in my mind regarding Hadoop cluster nodes
> ?
>
> more cores or more nodes - Shall I spend money on going from 2-4 core
> machines, or spend money on buying more nodes less core eg. say 2 machines
> of 2 cores for example?
>
> Thanks,
> Praveenesh
>
>
>


-- 
Regards,
Bharath .V
w:http://researchweb.iiit.ac.in/~bharath.v


RE: More cores Vs More Nodes ?

2011-12-13 Thread Brad Sarsfield
Hi Prashant,

In each case I had a single tasktracker per node. I oversubscribed the total 
tasks per tasktracker/node by 1.5 x # of cores.

So for the 64 core allocation comparison.
In A: 8 cores; Each machine had a single tasktracker with 8 maps / 4 
reduce slots for 12 task slots total per machine x 8 machines (including head 
node)
In B: 2 c   ores; Each machine had a single tasktracker with 2 maps 
/ 1 reduce slots for 3 slots total per machines x 29 machines (including head 
node which was running 8 cores)

The experiment was done in a cloud hosted environment running set of VMs.

~Brad

-Original Message-
From: Prashant Kommireddi [mailto:prash1...@gmail.com] 
Sent: Tuesday, December 13, 2011 9:46 AM
To: common-user@hadoop.apache.org
Subject: Re: More cores Vs More Nodes ?

Hi Brad, how many taskstrackers did you have on each node in both cases?

Thanks,
Prashant

Sent from my iPhone

On Dec 13, 2011, at 9:42 AM, Brad Sarsfield  wrote:

> Praveenesh,
>
> Your question is not naïve; in fact, optimal hardware design can ultimately 
> be a very difficult question to answer on what would be "better". If you made 
> me pick one without much information I'd go for more machines.  But...
>
> It all depends; and there is no right answer :)
>
> More machines
>+May run your workload faster
>+Will give you a higher degree of reliability protection from node / 
> hardware / hard drive failure.
>+More aggregate IO capabilities
>- capex / opex may be higher than allocating more cores More cores
>+May run your workload faster
>+More cores may allow for more tasks to run on the same machine
>+More cores/tasks may reduce network contention and increase increasing 
> task to task data flow performance.
>
> Notice "May run your workload faster" is in both; as it can be very workload 
> dependant.
>
> My Experience:
> I did a recent experiment and found that given the same number of cores (64) 
> with the exact same network / machine configuration;
>A: I had 8 machines with 8 cores
>B: I had 28 machines with 2 cores (and 1x8 core head node)
>
> B was able to outperform A by 2x using teragen and terasort. These machines 
> were running in a virtualized environment; where some of the IO capabilities 
> behind the scenes were being regulated to 400Mbps per node when running in 
> the 2 core configuration vs 1Gbps on the 8 core.  So I would expect the 
> non-throttled scenario to work even better.
>
> ~Brad
>
>
> -Original Message-
> From: praveenesh kumar [mailto:praveen...@gmail.com]
> Sent: Monday, December 12, 2011 8:51 PM
> To: common-user@hadoop.apache.org
> Subject: More cores Vs More Nodes ?
>
> Hey Guys,
>
> So I have a very naive question in my mind regarding Hadoop cluster nodes ?
>
> more cores or more nodes - Shall I spend money on going from 2-4 core 
> machines, or spend money on buying more nodes less core eg. say 2 machines of 
> 2 cores for example?
>
> Thanks,
> Praveenesh
>



Re: More cores Vs More Nodes ?

2011-12-13 Thread He Chen
Hi Brad

This is a really interesting experiment. I am curious why you did not use 2
cores each machine but 32 nodes. That makes the number of CPU core in two
groups equal.

Chen

On Tue, Dec 13, 2011 at 7:15 PM, Brad Sarsfield  wrote:

> Hi Prashant,
>
> In each case I had a single tasktracker per node. I oversubscribed the
> total tasks per tasktracker/node by 1.5 x # of cores.
>
> So for the 64 core allocation comparison.
>In A: 8 cores; Each machine had a single tasktracker with 8 maps /
> 4 reduce slots for 12 task slots total per machine x 8 machines (including
> head node)
>In B: 2 c   ores; Each machine had a single tasktracker with 2
> maps / 1 reduce slots for 3 slots total per machines x 29 machines
> (including head node which was running 8 cores)
>
> The experiment was done in a cloud hosted environment running set of VMs.
>
> ~Brad
>
> -Original Message-
> From: Prashant Kommireddi [mailto:prash1...@gmail.com]
> Sent: Tuesday, December 13, 2011 9:46 AM
> To: common-user@hadoop.apache.org
> Subject: Re: More cores Vs More Nodes ?
>
> Hi Brad, how many taskstrackers did you have on each node in both cases?
>
> Thanks,
> Prashant
>
> Sent from my iPhone
>
> On Dec 13, 2011, at 9:42 AM, Brad Sarsfield  wrote:
>
> > Praveenesh,
> >
> > Your question is not naïve; in fact, optimal hardware design can
> ultimately be a very difficult question to answer on what would be
> "better". If you made me pick one without much information I'd go for more
> machines.  But...
> >
> > It all depends; and there is no right answer :)
> >
> > More machines
> >+May run your workload faster
> >+Will give you a higher degree of reliability protection from node /
> hardware / hard drive failure.
> >+More aggregate IO capabilities
> >- capex / opex may be higher than allocating more cores More cores
> >+May run your workload faster
> >+More cores may allow for more tasks to run on the same machine
> >+More cores/tasks may reduce network contention and increase
> increasing task to task data flow performance.
> >
> > Notice "May run your workload faster" is in both; as it can be very
> workload dependant.
> >
> > My Experience:
> > I did a recent experiment and found that given the same number of cores
> (64) with the exact same network / machine configuration;
> >A: I had 8 machines with 8 cores
> >B: I had 28 machines with 2 cores (and 1x8 core head node)
> >
> > B was able to outperform A by 2x using teragen and terasort. These
> machines were running in a virtualized environment; where some of the IO
> capabilities behind the scenes were being regulated to 400Mbps per node
> when running in the 2 core configuration vs 1Gbps on the 8 core.  So I
> would expect the non-throttled scenario to work even better.
> >
> > ~Brad
> >
> >
> > -Original Message-
> > From: praveenesh kumar [mailto:praveen...@gmail.com]
> > Sent: Monday, December 12, 2011 8:51 PM
> > To: common-user@hadoop.apache.org
> > Subject: More cores Vs More Nodes ?
> >
> > Hey Guys,
> >
> > So I have a very naive question in my mind regarding Hadoop cluster
> nodes ?
> >
> > more cores or more nodes - Shall I spend money on going from 2-4 core
> machines, or spend money on buying more nodes less core eg. say 2 machines
> of 2 cores for example?
> >
> > Thanks,
> > Praveenesh
> >
>
>


RE: More cores Vs More Nodes ?

2011-12-14 Thread Michael Segel

Sorry,
But having read the thread, I am going to have to say that this is definitely a 
silly question.
NOTE THE FOLLOWING: Silly questions are not a bad thing. I happen to ask them 
all the time. ;-)

Here's why I say its a silly question...

Hadoop is a cost effective solution when you build out 'commodity' servers. 
Now here's the rub. 
Commodity servers means something different to each person, and I don't want to 
get in to a debate on its definition.

When building out a cluster, too many people gloss over the complexity. 1U vs 
2U in box size. Do you 1/2 MB or full size MB. How many disks per node. How 
much memory. Physical plant limitations. (Available rack space, costs if this 
is going in to a colo...) Power consumption, budget...

At a client, back in 2009, our first cluster was build on whatever hardware we 
could get. It was 5 blade servers w SCSI/SAS 2.5" disks where we split each 
blade so we could have 10 nodes. Yeah, it was a mistake and a royal pain. But 
we got the cluster up and could do some simple PoCs. But we then came up with 
our reference architecture for further PoCs and development. 
We build out the DN w 8 core, 32GB, and 4 x 2TB 3.5" drives. Why? Because based 
on our constraints, this gave us the optimal  combination w price and 
performance. Note: We knew we would leave some performance on the table. It was 
a conscious decision to leave some performance on the table so that we could 
maximize the number of nodes to fit within out budget.

We chose 2TB drives because at the time they offered the best price/performance 
ratio. Today, that may be different.
We chose 32GB because at the time it was the sweet spot in memory prices. Today 
w 3 channel memory it looks like 36GB is the sweet spot. Of course YMMV. (It 
could be 48GB...)

Moving forward, I would reconsider the design because the price points on 
hardware has changed. 

That's going to be your driving factor. 

You want to look at 64 Core boxes, then you need 256GB of memory. Think of how 
many disks you have to add. (64-128 disks)
Now then ask yourself is this a commodity box?

Now price that box out.
Then price out how many 8 core 1U boxes you can buy.

Kind of puts it in to perspective, doesn't it? ;-)

The reason why I call this a 'silly question' is that you're attempting to look 
at your cluster by focusing on only one variable. 
This is not to say that its a bad question because it forces you to realize 
that there are definitely lots of other options.  that you have to consider.

HTH

-Mike
 

> Date: Tue, 13 Dec 2011 20:25:17 -0600
> Subject: Re: More cores Vs More Nodes ?
> From: airb...@gmail.com
> To: common-user@hadoop.apache.org
> 
> Hi Brad
> 
> This is a really interesting experiment. I am curious why you did not use 2
> cores each machine but 32 nodes. That makes the number of CPU core in two
> groups equal.
> 
> Chen
> 
> On Tue, Dec 13, 2011 at 7:15 PM, Brad Sarsfield  wrote:
> 
> > Hi Prashant,
> >
> > In each case I had a single tasktracker per node. I oversubscribed the
> > total tasks per tasktracker/node by 1.5 x # of cores.
> >
> > So for the 64 core allocation comparison.
> >In A: 8 cores; Each machine had a single tasktracker with 8 maps /
> > 4 reduce slots for 12 task slots total per machine x 8 machines (including
> > head node)
> >In B: 2 c   ores; Each machine had a single tasktracker with 2
> > maps / 1 reduce slots for 3 slots total per machines x 29 machines
> > (including head node which was running 8 cores)
> >
> > The experiment was done in a cloud hosted environment running set of VMs.
> >
> > ~Brad
> >
> > -Original Message-
> > From: Prashant Kommireddi [mailto:prash1...@gmail.com]
> > Sent: Tuesday, December 13, 2011 9:46 AM
> > To: common-user@hadoop.apache.org
> > Subject: Re: More cores Vs More Nodes ?
> >
> > Hi Brad, how many taskstrackers did you have on each node in both cases?
> >
> > Thanks,
> > Prashant
> >
> > Sent from my iPhone
> >
> > On Dec 13, 2011, at 9:42 AM, Brad Sarsfield  wrote:
> >
> > > Praveenesh,
> > >
> > > Your question is not naïve; in fact, optimal hardware design can
> > ultimately be a very difficult question to answer on what would be
> > "better". If you made me pick one without much information I'd go for more
> > machines.  But...
> > >
> > > It all depends; and there is no right answer :)
> > >
> > > More machines
> > >+May run your workload faster
> > >+Will give you a higher degree of reliability protection from node /
> > hardware / hard drive failure.
> > >+More aggregate I

RE: More cores Vs More Nodes ?

2011-12-14 Thread Michael Segel

Aw Tommy, 
Actually no. You really don't want to do this.

If you actually ran a cluster and worked in the real world, you would find that 
if you purposely build a cluster for one job, there will be a mandate that some 
other group needs to use the cluster and that their job has different 
performance issues and your cluster is now suboptimal for their jobs...

Perhaps you meant that you needed to think about the purpose of the cluster? 
That is do you want to minimize the nodes but maximize the disk space per node 
and use the cluster as your backup cluster? (Assuming that you are considering 
your DR and BCP in your design.)

The problem with your answer, is that a job has a specific meaning within the 
Hadoop world.  You should have asked what is the purpose of the cluster. 

I agree w Brad, that it depends ... 

But the factors which will impact your cluster design are more along the lines 
of the purpose of the cluster and then the budget along with your IT 
constraints.

IMHO its better to avoid building purpose built clusters. You end up not being 
able to easily recycle the hardware in to new clusters easily. 

But hey what do I know? ;-)

> To: common-user@hadoop.apache.org
> Subject: RE: More cores Vs More Nodes ?
> From: tdeut...@us.ibm.com
> Date: Tue, 13 Dec 2011 09:46:49 -0800
> 
> It also helps to know the profile of your job in how you spec the 
> machines. So in addition to Brad's response you should consider if you 
> think your jobs will be more storage or compute oriented. 
> 
> 
> Tom Deutsch
> Program Director
> Information Management
> Big Data Technologies
> IBM
> 3565 Harbor Blvd
> Costa Mesa, CA 92626-1420
> tdeut...@us.ibm.com
> 
> 
> 
> 
> Brad Sarsfield  
> 12/13/2011 09:41 AM
> Please respond to
> common-user@hadoop.apache.org
> 
> 
> To
> "common-user@hadoop.apache.org" 
> cc
> 
> Subject
> RE: More cores Vs More Nodes ?
> 
> 
> 
> 
> 
> 
> Praveenesh,
> 
> Your question is not naïve; in fact, optimal hardware design can 
> ultimately be a very difficult question to answer on what would be 
> "better". If you made me pick one without much information I'd go for more 
> machines.  But...
> 
> It all depends; and there is no right answer :) 
> 
> More machines 
>  +May run your workload faster
>  +Will give you a higher degree of reliability protection 
> from node / hardware / hard drive failure.
>  +More aggregate IO capabilities
>  - capex / opex may be higher than allocating more cores
> More cores 
>  +May run your workload faster
>  +More cores may allow for more tasks to run on the same 
> machine
>  +More cores/tasks may reduce network contention and 
> increase increasing task to task data flow performance.
> 
> Notice "May run your workload faster" is in both; as it can be very 
> workload dependant.
> 
> My Experience:
> I did a recent experiment and found that given the same number of cores 
> (64) with the exact same network / machine configuration; 
>  A: I had 8 machines with 8 cores 
>  B: I had 28 machines with 2 cores (and 1x8 core head 
> node)
> 
> B was able to outperform A by 2x using teragen and terasort. These 
> machines were running in a virtualized environment; where some of the IO 
> capabilities behind the scenes were being regulated to 400Mbps per node 
> when running in the 2 core configuration vs 1Gbps on the 8 core.  So I 
> would expect the non-throttled scenario to work even better. 
> 
> ~Brad
> 
> 
> -Original Message-
> From: praveenesh kumar [mailto:praveen...@gmail.com] 
> Sent: Monday, December 12, 2011 8:51 PM
> To: common-user@hadoop.apache.org
> Subject: More cores Vs More Nodes ?
> 
> Hey Guys,
> 
> So I have a very naive question in my mind regarding Hadoop cluster nodes 
> ?
> 
> more cores or more nodes - Shall I spend money on going from 2-4 core 
> machines, or spend money on buying more nodes less core eg. say 2 machines 
> of 2 cores for example?
> 
> Thanks,
> Praveenesh
> 
> 
  

Re: More cores Vs More Nodes ?

2011-12-14 Thread Brian Bockelman
Actually, there are varying degrees here.

If you have a successful project, you will find other groups at your door 
wanting to use the cluster too.  Their jobs might be different from the 
original use case.

However, if you don't understand the original use case ("CPU heavy or storage 
heavy?" is a great beginning question), your original project won't be 
successful.  Then there will be no follow-up users because you failed.

So, you want to have a reasonably general-purpose cluster, but make sure it 
matches well with the type of jobs.  As an example, we had one group who 
required an estimated CPU-millenia per byte of data… they needed a "general 
purpose cluster" for a certain value of "general purpose".

Brian

On Dec 14, 2011, at 7:29 AM, Michael Segel wrote:

> 
> Aw Tommy, 
> Actually no. You really don't want to do this.
> 
> If you actually ran a cluster and worked in the real world, you would find 
> that if you purposely build a cluster for one job, there will be a mandate 
> that some other group needs to use the cluster and that their job has 
> different performance issues and your cluster is now suboptimal for their 
> jobs...
> 
> Perhaps you meant that you needed to think about the purpose of the cluster? 
> That is do you want to minimize the nodes but maximize the disk space per 
> node and use the cluster as your backup cluster? (Assuming that you are 
> considering your DR and BCP in your design.)
> 
> The problem with your answer, is that a job has a specific meaning within the 
> Hadoop world.  You should have asked what is the purpose of the cluster. 
> 
> I agree w Brad, that it depends ... 
> 
> But the factors which will impact your cluster design are more along the 
> lines of the purpose of the cluster and then the budget along with your IT 
> constraints.
> 
> IMHO its better to avoid building purpose built clusters. You end up not 
> being able to easily recycle the hardware in to new clusters easily. 
> 
> But hey what do I know? ;-)
> 
>> To: common-user@hadoop.apache.org
>> Subject: RE: More cores Vs More Nodes ?
>> From: tdeut...@us.ibm.com
>> Date: Tue, 13 Dec 2011 09:46:49 -0800
>> 
>> It also helps to know the profile of your job in how you spec the 
>> machines. So in addition to Brad's response you should consider if you 
>> think your jobs will be more storage or compute oriented. 
>> 
>> 
>> Tom Deutsch
>> Program Director
>> Information Management
>> Big Data Technologies
>> IBM
>> 3565 Harbor Blvd
>> Costa Mesa, CA 92626-1420
>> tdeut...@us.ibm.com
>> 
>> 
>> 
>> 
>> Brad Sarsfield  
>> 12/13/2011 09:41 AM
>> Please respond to
>> common-user@hadoop.apache.org
>> 
>> 
>> To
>> "common-user@hadoop.apache.org" 
>> cc
>> 
>> Subject
>> RE: More cores Vs More Nodes ?
>> 
>> 
>> 
>> 
>> 
>> 
>> Praveenesh,
>> 
>> Your question is not naïve; in fact, optimal hardware design can 
>> ultimately be a very difficult question to answer on what would be 
>> "better". If you made me pick one without much information I'd go for more 
>> machines.  But...
>> 
>> It all depends; and there is no right answer :) 
>> 
>> More machines 
>> +May run your workload faster
>> +Will give you a higher degree of reliability protection 
>> from node / hardware / hard drive failure.
>> +More aggregate IO capabilities
>> - capex / opex may be higher than allocating more cores
>> More cores 
>> +May run your workload faster
>> +More cores may allow for more tasks to run on the same 
>> machine
>> +More cores/tasks may reduce network contention and 
>> increase increasing task to task data flow performance.
>> 
>> Notice "May run your workload faster" is in both; as it can be very 
>> workload dependant.
>> 
>> My Experience:
>> I did a recent experiment and found that given the same number of cores 
>> (64) with the exact same network / machine configuration; 
>> A: I had 8 machines with 8 cores 
>> B: I had 28 machines with 2 cores (and 1x8 core head 
>> node)
>> 
>> B was able to outperform A by 2x using teragen and terasort. These 
>> machines were running in a virtualized environment; where some of the IO 
>> capabilities behind the scenes were being regulated to 400Mbps per n

RE: More cores Vs More Nodes ?

2011-12-14 Thread Tom Deutsch
Putting aside any smarmy responses for a moment - sorry that "job(s)" 
wasn't understood as equating to "purpose".

If you are building a general purpose sandbox then I think we all agree on 
building a "balanced" general purpose cluster. But if you have production 
use cases in mind then you darn well better try to understand how the 
cluster will be used/stressed so you don't end up with a hardware spec 
that doesn't match how the cluster is actually used.

If you can't profile a production use case as to how it will stress the 
cluster that is a huge warning sign as to project risk. If you are tearing 
down and re-purposing a cluster that was implemented to support a 
production use case then the planning failed. 


Tom Deutsch
Program Director
Information Management
Big Data Technologies
IBM
3565 Harbor Blvd
Costa Mesa, CA 92626-1420
tdeut...@us.ibm.com



RE: More cores Vs More Nodes ?

2011-12-14 Thread Michael Segel


Brian,

I think you missed my point.

The moment you go and design a cluster for a specific job, you end up getting 
fscked because there's another group who wants to use the shared resource for 
their job which could be orthogonal to the original purpose. It happens 
everyday.

This is why you have to ask if the cluster is being built for a specific 
purpose. Meaning answering the question 'Which of the following best describes 
your cluster: 
a) PoC
b) Development
c) Pre-prod
d) Production
e) Secondary/Backup
"

Note that sizing the cluster is a different matter. 
Meaning if you know you need a PB of storage, you're going to design the 
cluster differently because once you get to a certain size, you have to 
recognize that your clusters are going to have lots of disk, require 10GBe just 
for the storage. Number of cores would be less of an issue, however again look 
at pricing. 2 socket 8 core Xeon MBs are currently at an optimal price point. 

And again this goes back to the point I was trying to make.
You need to look beyond the number of cores as a determining factor. 
You go too small, you're going to take a hit because of the price/performance 
curve. 
(Remember that you have to consider Machine Room real estate. 100 2 core boxes 
take up much more space than 25 8 core boxes)

If you go to the other extreme... 64 core giant SMP box $ for $$$ (less 
money) build out an 8 node cluster. 

Beyond that, you really, really don't want to build a custom cluster for a 
specific job unless you know that you're going to be running that specific job 
or set of jobs (24x7X365) [And yes, I came across such a use case...]

HTH

-Mike
> From: bbock...@cse.unl.edu
> Subject: Re: More cores Vs More Nodes ?
> Date: Wed, 14 Dec 2011 07:41:25 -0600
> To: common-user@hadoop.apache.org
> 
> Actually, there are varying degrees here.
> 
> If you have a successful project, you will find other groups at your door 
> wanting to use the cluster too.  Their jobs might be different from the 
> original use case.
> 
> However, if you don't understand the original use case ("CPU heavy or storage 
> heavy?" is a great beginning question), your original project won't be 
> successful.  Then there will be no follow-up users because you failed.
> 
> So, you want to have a reasonably general-purpose cluster, but make sure it 
> matches well with the type of jobs.  As an example, we had one group who 
> required an estimated CPU-millenia per byte of data… they needed a "general 
> purpose cluster" for a certain value of "general purpose".
> 
> Brian
> 
> On Dec 14, 2011, at 7:29 AM, Michael Segel wrote:
> 
> > 
> > Aw Tommy, 
> > Actually no. You really don't want to do this.
> > 
> > If you actually ran a cluster and worked in the real world, you would find 
> > that if you purposely build a cluster for one job, there will be a mandate 
> > that some other group needs to use the cluster and that their job has 
> > different performance issues and your cluster is now suboptimal for their 
> > jobs...
> > 
> > Perhaps you meant that you needed to think about the purpose of the 
> > cluster? That is do you want to minimize the nodes but maximize the disk 
> > space per node and use the cluster as your backup cluster? (Assuming that 
> > you are considering your DR and BCP in your design.)
> > 
> > The problem with your answer, is that a job has a specific meaning within 
> > the Hadoop world.  You should have asked what is the purpose of the 
> > cluster. 
> > 
> > I agree w Brad, that it depends ... 
> > 
> > But the factors which will impact your cluster design are more along the 
> > lines of the purpose of the cluster and then the budget along with your IT 
> > constraints.
> > 
> > IMHO its better to avoid building purpose built clusters. You end up not 
> > being able to easily recycle the hardware in to new clusters easily. 
> > 
> > But hey what do I know? ;-)
> > 
> >> To: common-user@hadoop.apache.org
> >> Subject: RE: More cores Vs More Nodes ?
> >> From: tdeut...@us.ibm.com
> >> Date: Tue, 13 Dec 2011 09:46:49 -0800
> >> 
> >> It also helps to know the profile of your job in how you spec the 
> >> machines. So in addition to Brad's response you should consider if you 
> >> think your jobs will be more storage or compute oriented. 
> >> 
> >> ------------
> >> Tom Deutsch
> >> Program Director
> >> Information Management
> >> Big Data Technologies
> >> IBM
> >> 3565 Harbor Blvd
> >> Costa Mesa, 

RE: More cores Vs More Nodes ?

2011-12-14 Thread Michael Segel

Tommy,

Again, I think you need to really have some real world experience before you 
make generalizations like that.

Sorry, but at a client, we put 6 different groups' applications in production. 
Without going in to detail the jobs in production were orthogonal to one 
another. The point is that were we to build our cluster optimized to one job we 
would have been screwed. Oh wait, I forgot that you worked for IBM and they 
would love to sell you more hardware and consulting to improve the situation... 
(I kee-id, I kee-id) 

Now Seriously, 
The point of this discussion is that you really, really don't want to build the 
cluster optimized for a single job.
The only time you want to do that is if you have a job or set of jobs that you 
plan on running every day 24x7 and the job takes the entire cluster. 
Yes, such jobs do exist. However they are highly irregular and definitely not 
the norm.

One of the other pain points is that developers have to get used to the cluster 
as a shared resource to be used between different teams. This helps to defer 
the costs including maintenance. So as a shared resource, development and 
production, you need to build out a box that handles everything equally.

Had you attended our session at Hadoop World, not only would you have learned 
this... (Don't tune the cluster to the application, but tune the application to 
the cluster) I would have also poked fun of you in person. ;-)

We also talked about avoiding the internet myths and 'truisms'. 

Unless you've had your hands dirty and at customer's sites you're going to find 
the real world is a different place. ;-)
But hey! What do I know?


> To: common-user@hadoop.apache.org
> Subject: RE: More cores Vs More Nodes ?
> From: tdeut...@us.ibm.com
> Date: Wed, 14 Dec 2011 07:56:30 -0800
> 
> Putting aside any smarmy responses for a moment - sorry that "job(s)" 
> wasn't understood as equating to "purpose".
> 
> If you are building a general purpose sandbox then I think we all agree on 
> building a "balanced" general purpose cluster. But if you have production 
> use cases in mind then you darn well better try to understand how the 
> cluster will be used/stressed so you don't end up with a hardware spec 
> that doesn't match how the cluster is actually used.
> 
> If you can't profile a production use case as to how it will stress the 
> cluster that is a huge warning sign as to project risk. If you are tearing 
> down and re-purposing a cluster that was implemented to support a 
> production use case then the planning failed. 
> 
> 
> Tom Deutsch
> Program Director
> Information Management
> Big Data Technologies
> IBM
> 3565 Harbor Blvd
> Costa Mesa, CA 92626-1420
> tdeut...@us.ibm.com
> 
  

Re: More cores Vs More Nodes ?

2011-12-14 Thread Scott Carey


On 12/14/11 9:05 AM, "Michael Segel"  wrote:

>
>
>Brian,
>
>I think you missed my point.
>
>The moment you go and design a cluster for a specific job, you end up
>getting fscked because there's another group who wants to use the shared
>resource for their job which could be orthogonal to the original purpose.
>It happens everyday.
>
>This is why you have to ask if the cluster is being built for a specific
>purpose. Meaning answering the question 'Which of the following best
>describes your cluster:
>a) PoC
>b) Development
>c) Pre-prod
>d) Production
>e) Secondary/Backup
>"
>
>Note that sizing the cluster is a different matter.
>Meaning if you know you need a PB of storage, you're going to design the
>cluster differently because once you get to a certain size, you have to
>recognize that your clusters are going to have lots of disk, require
>10GBe just for the storage. Number of cores would be less of an issue,
>however again look at pricing. 2 socket 8 core Xeon MBs are currently at
>an optimal price point.

Recently, single socket servers have been 9 to 12 months ahead of the
curve on next generation processor availability.

I found 1 socket quad core Xeon a better value because a single socket 4
core system performs at the CPU level of ~5.5 cores of a dual socket
system due to faster Ghz and newer generation processors on the single
socket system -- At least earlier this year.   Sandy Bridge is finally
moving to dual socket.   Single socket quad core Xeon at 3.4Ghz is much
more than half as capable as dual socket quad @2.66Ghz.

1 socket versus 2 is a moving target.

In our case, we had a $ budget and a low power/rack capacity.   We
compared what we could get for various designs in terms of:

aggregate CPU  (CPU core count * Ghz)
aggregate Memory bandwidth
aggregate RAM
aggregate Disk capacity
aggregate network throughput

And chose the single socket, 1U system based on our constraints and what
we could get with a variety of designs (all single socket or dual socket,
1U and 2U nodes, 4 to 12 drives / node).   We had a range of acceptable
Storage to CPU ratio, CPU to RAM ratio, and network to storage ratio.
With fewer CPU we had fewer disk and less RAM per machine, but more total
servers.  This was also influenced by availability concerns -- the more
disk per node, the faster your network per node needs to be in order to
replicate on a failure.  Smaller servers meant significantly cheaper
network since bonded 1Gb link pairs were good enough.

Given various constraints and needs different organizations will find
different sweet spots.  And given the hardware available at the time, the
sweet spot moves as well.

> 
>
>And again this goes back to the point I was trying to make.
>You need to look beyond the number of cores as a determining factor.
>You go too small, you're going to take a hit because of the
>price/performance curve.
>(Remember that you have to consider Machine Room real estate. 100 2 core
>boxes take up much more space than 25 8 core boxes)
>
>If you go to the other extreme... 64 core giant SMP box $ for $$$
>(less money) build out an 8 node cluster.
>
>Beyond that, you really, really don't want to build a custom cluster for
>a specific job unless you know that you're going to be running that
>specific job or set of jobs (24x7X365) [And yes, I came across such a use
>case...]
>
>HTH
>
>-Mike
>> From: bbock...@cse.unl.edu
>> Subject: Re: More cores Vs More Nodes ?
>> Date: Wed, 14 Dec 2011 07:41:25 -0600
>> To: common-user@hadoop.apache.org
>> 
>> Actually, there are varying degrees here.
>> 
>> If you have a successful project, you will find other groups at your
>>door wanting to use the cluster too.  Their jobs might be different from
>>the original use case.
>> 
>> However, if you don't understand the original use case ("CPU heavy or
>>storage heavy?" is a great beginning question), your original project
>>won't be successful.  Then there will be no follow-up users because you
>>failed.
>> 
>> So, you want to have a reasonably general-purpose cluster, but make
>>sure it matches well with the type of jobs.  As an example, we had one
>>group who required an estimated CPU-millenia per byte of dataŠ they
>>needed a "general purpose cluster" for a certain value of "general
>>purpose".
>> 
>> Brian
>> 
>> On Dec 14, 2011, at 7:29 AM, Michael Segel wrote:
>> 
>> > 
>> > Aw Tommy, 
>> > Actually no. You really don't want to do this.
>> > 
>> > If you actually ran a cluster and worked in the real world, you would
>>find that if you purposely build a c

RE: More cores Vs More Nodes ?

2011-12-14 Thread Tom Deutsch
Your eagerness to insult is throwing you off track here Michael. 

For example, the workload profile of a cluster doing heavy NLP is very 
different than one doing serving as a destination for large scale 
application/web logs. Ditto for P&C risk modeling vs smart meter use 
cases, etc etc...Those are not general purpose clusters. You may - and 
should I'd say - have the NLP use cases in a common analytics environment 
(internal cloud model) for sharing of methods/skills, but putting 
orthogonal use cases on that cluster is not inherently a best practice.

How those clusters should be built does vary, and no it is not uncommon to 
have focused use cases like that. If you know it is going to be a general 
purpose cluster then do build it in a balanced spec. 




Tom Deutsch
Program Director
Information Management
Big Data Technologies
IBM
3565 Harbor Blvd
Costa Mesa, CA 92626-1420
tdeut...@us.ibm.com

Re: More cores Vs More Nodes ?

2011-12-14 Thread Russell Jurney
You're using OS virtualization in your test.  Are you using it in production?

Russell Jurney
twitter.com/rjurney
russell.jur...@gmail.com
datasyndrome.com

On Dec 13, 2011, at 5:16 PM, Brad Sarsfield  wrote:

> The experiment was done in a cloud hosted environment running set of VMs.
>
> ~Brad
>
> -Original Message-
> From: Prashant Kommireddi [mailto:prash1...@gmail.com]
> Sent: Tuesday, December 13, 2011 9:46 AM
> To: common-user@hadoop.apache.org
> Subject: Re: More cores Vs More Nodes ?
>
> Hi Brad, how many taskstrackers did you have on each node in both cases?
>
> Thanks,
> Prashant
>
> Sent from my iPhone
>
> On Dec 13, 2011, at 9:42 AM, Brad Sarsfield  wrote:
>
>> Praveenesh,
>>
>> Your question is not naïve; in fact, optimal hardware design can ultimately 
>> be a very difficult question to answer on what would be "better". If you 
>> made me pick one without much information I'd go for more machines.  But...
>>
>> It all depends; and there is no right answer :)
>>
>> More machines
>>  +May run your workload faster
>>  +Will give you a higher degree of reliability protection from node / 
>> hardware / hard drive failure.
>>  +More aggregate IO capabilities
>>  - capex / opex may be higher than allocating more cores More cores
>>  +May run your workload faster
>>  +More cores may allow for more tasks to run on the same machine
>>  +More cores/tasks may reduce network contention and increase increasing 
>> task to task data flow performance.
>>
>> Notice "May run your workload faster" is in both; as it can be very workload 
>> dependant.
>>
>> My Experience:
>> I did a recent experiment and found that given the same number of cores (64) 
>> with the exact same network / machine configuration;
>>  A: I had 8 machines with 8 cores
>>  B: I had 28 machines with 2 cores (and 1x8 core head node)
>>
>> B was able to outperform A by 2x using teragen and terasort. These machines 
>> were running in a virtualized environment; where some of the IO capabilities 
>> behind the scenes were being regulated to 400Mbps per node when running in 
>> the 2 core configuration vs 1Gbps on the 8 core.  So I would expect the 
>> non-throttled scenario to work even better.
>>
>> ~Brad
>>
>>
>> -Original Message-
>> From: praveenesh kumar [mailto:praveen...@gmail.com]
>> Sent: Monday, December 12, 2011 8:51 PM
>> To: common-user@hadoop.apache.org
>> Subject: More cores Vs More Nodes ?
>>
>> Hey Guys,
>>
>> So I have a very naive question in my mind regarding Hadoop cluster nodes ?
>>
>> more cores or more nodes - Shall I spend money on going from 2-4 core 
>> machines, or spend money on buying more nodes less core eg. say 2 machines 
>> of 2 cores for example?
>>
>> Thanks,
>> Praveenesh
>>
>


RE: More cores Vs More Nodes ?

2011-12-14 Thread Brad Sarsfield
Hi Russell,

We will be in production soon with both OS virtualized Hadoop deployments along 
with existing bare metal deployments.

We are finding tradeoffs on both sides. On the virtualization side; cluster 
elasticity and deployment times are easier. Speed of node recovery can be a 
faster with VM image restore. VM migration from one server to another makes 
planned hardware upgrades/repairs easier. But there's always the virtualization 
overhead/tax to pay along with what can be a set of multi-vm or multi-tenancy 
overhead.

I have been thinking about experimenting with a topology/rack level awareness 
scheme where one would map physical VM hosts to the VM's Hadoop instance rack 
affinity nesting level.

~Brad

-Original Message-
From: Russell Jurney [mailto:russell.jur...@gmail.com] 
Sent: Wednesday, December 14, 2011 1:27 PM
To: common-user@hadoop.apache.org
Subject: Re: More cores Vs More Nodes ?

You're using OS virtualization in your test.  Are you using it in production?

Russell Jurney
twitter.com/rjurney
russell.jur...@gmail.com
datasyndrome.com

On Dec 13, 2011, at 5:16 PM, Brad Sarsfield  wrote:

> The experiment was done in a cloud hosted environment running set of VMs.
>
> ~Brad
>
> -Original Message-
> From: Prashant Kommireddi [mailto:prash1...@gmail.com]
> Sent: Tuesday, December 13, 2011 9:46 AM
> To: common-user@hadoop.apache.org
> Subject: Re: More cores Vs More Nodes ?
>
> Hi Brad, how many taskstrackers did you have on each node in both cases?
>
> Thanks,
> Prashant
>
> Sent from my iPhone
>
> On Dec 13, 2011, at 9:42 AM, Brad Sarsfield  wrote:
>
>> Praveenesh,
>>
>> Your question is not naïve; in fact, optimal hardware design can ultimately 
>> be a very difficult question to answer on what would be "better". If you 
>> made me pick one without much information I'd go for more machines.  But...
>>
>> It all depends; and there is no right answer :)
>>
>> More machines
>>  +May run your workload faster
>>  +Will give you a higher degree of reliability protection from node / 
>> hardware / hard drive failure.
>>  +More aggregate IO capabilities
>>  - capex / opex may be higher than allocating more cores More cores  
>> +May run your workload faster  +More cores may allow for more tasks 
>> to run on the same machine  +More cores/tasks may reduce network 
>> contention and increase increasing task to task data flow performance.
>>
>> Notice "May run your workload faster" is in both; as it can be very workload 
>> dependant.
>>
>> My Experience:
>> I did a recent experiment and found that given the same number of 
>> cores (64) with the exact same network / machine configuration;
>>  A: I had 8 machines with 8 cores
>>  B: I had 28 machines with 2 cores (and 1x8 core head node)
>>
>> B was able to outperform A by 2x using teragen and terasort. These machines 
>> were running in a virtualized environment; where some of the IO capabilities 
>> behind the scenes were being regulated to 400Mbps per node when running in 
>> the 2 core configuration vs 1Gbps on the 8 core.  So I would expect the 
>> non-throttled scenario to work even better.
>>
>> ~Brad
>>
>>
>> -Original Message-
>> From: praveenesh kumar [mailto:praveen...@gmail.com]
>> Sent: Monday, December 12, 2011 8:51 PM
>> To: common-user@hadoop.apache.org
>> Subject: More cores Vs More Nodes ?
>>
>> Hey Guys,
>>
>> So I have a very naive question in my mind regarding Hadoop cluster nodes ?
>>
>> more cores or more nodes - Shall I spend money on going from 2-4 core 
>> machines, or spend money on buying more nodes less core eg. say 2 machines 
>> of 2 cores for example?
>>
>> Thanks,
>> Praveenesh
>>
>



RE: More cores Vs More Nodes ?

2011-12-15 Thread Michael Segel


Tom,

Look, 
I've said this before and I'm going to say it again.

Your knowledge of Hadoop is purely academic. It may be ok to talk to C level 
execs who visit the San Jose IM Lab or in Markham, but when you give answers on 
issues you don't have first hand practical experience, you end up doing more 
harm than good.

The problem is that too many people blindly except what they see on the web as 
fact when its not always accurate and may not suit their needs.
I've lost count on the number of hours I've spent in meetings trying to undo 
the damage cause by someone saying "... but FB does it this way...therefore 
that's how we should do it."

Now Michael St.Ack is a pretty smart guy. He knows his shit. He's extremely 
credible. However when he says that FB does something a specific way, that is 
because FB has certain requirements and the solution works for them. It doesn't 
mean that it will be the best solution for your customer/client.

And Tom, if we pull out your business card, you have a nice fancy title with 
IBM. So you instantly have some credibility. Unfortunately, you're no St.Ack.  
(I'd put a smile face but I'm actually trying to be serious.)

Even in this post, you continue to go down the wrong path. 
Unfortunately I don't have time to lecture you on why what you said is wrong 
and that your thoughts on cluster design are way off base. 
Oh and I tease you because frankly, you deserve it. 

I have to apologize to everyone on the list, but in the past, you failed to 
actually stop and take the hint that maybe you need to rethink your views on 
Hadoop.  That had you had practical experience setting up actual clusters (Not 
EC2 clusters) you would have the necessary understanding of what can go wrong 
and how to fix it. 

If I get time, I'll have to find my copy of "Up Front" by Bill Maudlin. There's 
a cartoon that really fits you.

Later


> To: common-user@hadoop.apache.org
> Subject: RE: More cores Vs More Nodes ?
> From: tdeut...@us.ibm.com
> Date: Wed, 14 Dec 2011 11:40:51 -0800
> 
> Your eagerness to insult is throwing you off track here Michael. 
> 
> For example, the workload profile of a cluster doing heavy NLP is very 
> different than one doing serving as a destination for large scale 
> application/web logs. Ditto for P&C risk modeling vs smart meter use 
> cases, etc etc...Those are not general purpose clusters. You may - and 
> should I'd say - have the NLP use cases in a common analytics environment 
> (internal cloud model) for sharing of methods/skills, but putting 
> orthogonal use cases on that cluster is not inherently a best practice.
> 
> How those clusters should be built does vary, and no it is not uncommon to 
> have focused use cases like that. If you know it is going to be a general 
> purpose cluster then do build it in a balanced spec. 
> 
> 
> 
> 
> Tom Deutsch
> Program Director
> Information Management
> Big Data Technologies
> IBM
> 3565 Harbor Blvd
> Costa Mesa, CA 92626-1420
> tdeut...@us.ibm.com
  

Re: More cores Vs More Nodes ?

2011-12-17 Thread Michel Segel
Brad you said 64 core allocations.
So how many cores are lost due to the overhead of virtualization?
Isn't it 1 core per VM?
So you end up losing 8 cores when you create 8 vms... Right?

Sent from a remote device. Please excuse any typos...

Mike Segel

On Dec 13, 2011, at 7:15 PM, Brad Sarsfield  wrote:

> Hi Prashant,
> 
> In each case I had a single tasktracker per node. I oversubscribed the total 
> tasks per tasktracker/node by 1.5 x # of cores.
> 
> So for the 64 core allocation comparison.
>In A: 8 cores; Each machine had a single tasktracker with 8 maps / 4 
> reduce slots for 12 task slots total per machine x 8 machines (including head 
> node)
>In B: 2 cores; Each machine had a single tasktracker with 2 maps / 1 
> reduce slots for 3 slots total per machines x 29 machines (including head 
> node which was running 8 cores)
> 
> The experiment was done in a cloud hosted environment running set of VMs.
> 
> ~Brad
> 
> -Original Message-
> From: Prashant Kommireddi [mailto:prash1...@gmail.com] 
> Sent: Tuesday, December 13, 2011 9:46 AM
> To: common-user@hadoop.apache.org
> Subject: Re: More cores Vs More Nodes ?
> 
> Hi Brad, how many taskstrackers did you have on each node in both cases?
> 
> Thanks,
> Prashant
> 
> Sent from my iPhone
> 
> On Dec 13, 2011, at 9:42 AM, Brad Sarsfield  wrote:
> 
>> Praveenesh,
>> 
>> Your question is not naïve; in fact, optimal hardware design can ultimately 
>> be a very difficult question to answer on what would be "better". If you 
>> made me pick one without much information I'd go for more machines.  But...
>> 
>> It all depends; and there is no right answer :)
>> 
>> More machines
>>   +May run your workload faster
>>   +Will give you a higher degree of reliability protection from node / 
>> hardware / hard drive failure.
>>   +More aggregate IO capabilities
>>   - capex / opex may be higher than allocating more cores More cores
>>   +May run your workload faster
>>   +More cores may allow for more tasks to run on the same machine
>>   +More cores/tasks may reduce network contention and increase increasing 
>> task to task data flow performance.
>> 
>> Notice "May run your workload faster" is in both; as it can be very workload 
>> dependant.
>> 
>> My Experience:
>> I did a recent experiment and found that given the same number of cores (64) 
>> with the exact same network / machine configuration;
>>   A: I had 8 machines with 8 cores
>>   B: I had 28 machines with 2 cores (and 1x8 core head node)
>> 
>> B was able to outperform A by 2x using teragen and terasort. These machines 
>> were running in a virtualized environment; where some of the IO capabilities 
>> behind the scenes were being regulated to 400Mbps per node when running in 
>> the 2 core configuration vs 1Gbps on the 8 core.  So I would expect the 
>> non-throttled scenario to work even better.
>> 
>> ~Brad
>> 
>> 
>> -Original Message-
>> From: praveenesh kumar [mailto:praveen...@gmail.com]
>> Sent: Monday, December 12, 2011 8:51 PM
>> To: common-user@hadoop.apache.org
>> Subject: More cores Vs More Nodes ?
>> 
>> Hey Guys,
>> 
>> So I have a very naive question in my mind regarding Hadoop cluster nodes ?
>> 
>> more cores or more nodes - Shall I spend money on going from 2-4 core 
>> machines, or spend money on buying more nodes less core eg. say 2 machines 
>> of 2 cores for example?
>> 
>> Thanks,
>> Praveenesh
>> 
> 
>