There could be many different things causing this. For example, if you only 
have a single partition of data, increasing the number of tasks will only 
increase execution time due to higher scheduling overhead. Additionally, how 
large is a single partition in your application relative to the amount of 
memory on the machine? If you are running on a machine with a small amount of 
memory, increasing the number of executors per machine may increase GC/memory 
pressure. On a single node, since your executors share a memory and I/O system, 
you could just thrash everything.

In any case, you can’t normally generalize between increased parallelism on a 
single node and increased parallelism across a cluster. If you are purely 
limited by CPU, then yes, you can normally make that generalization. However, 
when you increase the number of workers in a cluster, you are providing your 
app with more resources (memory capacity and bandwidth, and disk bandwidth). 
When you increase the number of tasks executing on a single node, you do not 
increase the pool of available resources.

Frank Austin Nothaft
fnoth...@berkeley.edu
fnoth...@eecs.berkeley.edu
202-340-0466

On Feb 21, 2015, at 4:11 PM, Deep Pradhan <pradhandeep1...@gmail.com> wrote:

> No, I just have a single node standalone cluster.
> 
> I am not tweaking around with the code to increase parallelism. I am just 
> running SparkKMeans that is there in Spark-1.0.0
> I just wanted to know, if this behavior is natural. And if so, what causes 
> this?
> 
> Thank you
> 
> On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen <so...@cloudera.com> wrote:
> What's your storage like? are you adding worker machines that are
> remote from where the data lives? I wonder if it just means you are
> spending more and more time sending the data over the network as you
> try to ship more of it to more remote workers.
> 
> To answer your question, no in general more workers means more
> parallelism and therefore faster execution. But that depends on a lot
> of things. For example, if your process isn't parallelize to use all
> available execution slots, adding more slots doesn't do anything.
> 
> On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan <pradhandeep1...@gmail.com> 
> wrote:
> > Yes, I am talking about standalone single node cluster.
> >
> > No, I am not increasing parallelism. I just wanted to know if it is natural.
> > Does message passing across the workers account for the happenning?
> >
> > I am running SparkKMeans, just to validate one prediction model. I am using
> > several data sets. I have a standalone mode. I am varying the workers from 1
> > to 16
> >
> > On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> I can imagine a few reasons. Adding workers might cause fewer tasks to
> >> execute locally (?) So you may be execute more remotely.
> >>
> >> Are you increasing parallelism? for trivial jobs, chopping them up
> >> further may cause you to pay more overhead of managing so many small
> >> tasks, for no speed up in execution time.
> >>
> >> Can you provide any more specifics though? you haven't said what
> >> you're running, what mode, how many workers, how long it takes, etc.
> >>
> >> On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan <pradhandeep1...@gmail.com>
> >> wrote:
> >> > Hi,
> >> > I have been running some jobs in my local single node stand alone
> >> > cluster. I
> >> > am varying the worker instances for the same job, and the time taken for
> >> > the
> >> > job to complete increases with increase in the number of workers. I
> >> > repeated
> >> > some experiments varying the number of nodes in a cluster too and the
> >> > same
> >> > behavior is seen.
> >> > Can the idea of worker instances be extrapolated to the nodes in a
> >> > cluster?
> >> >
> >> > Thank You
> >
> >
> 

Reply via email to