There could be many different things causing this. For example, if you only have a single partition of data, increasing the number of tasks will only increase execution time due to higher scheduling overhead. Additionally, how large is a single partition in your application relative to the amount of memory on the machine? If you are running on a machine with a small amount of memory, increasing the number of executors per machine may increase GC/memory pressure. On a single node, since your executors share a memory and I/O system, you could just thrash everything.
In any case, you can’t normally generalize between increased parallelism on a single node and increased parallelism across a cluster. If you are purely limited by CPU, then yes, you can normally make that generalization. However, when you increase the number of workers in a cluster, you are providing your app with more resources (memory capacity and bandwidth, and disk bandwidth). When you increase the number of tasks executing on a single node, you do not increase the pool of available resources. Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Feb 21, 2015, at 4:11 PM, Deep Pradhan <pradhandeep1...@gmail.com> wrote: > No, I just have a single node standalone cluster. > > I am not tweaking around with the code to increase parallelism. I am just > running SparkKMeans that is there in Spark-1.0.0 > I just wanted to know, if this behavior is natural. And if so, what causes > this? > > Thank you > > On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen <so...@cloudera.com> wrote: > What's your storage like? are you adding worker machines that are > remote from where the data lives? I wonder if it just means you are > spending more and more time sending the data over the network as you > try to ship more of it to more remote workers. > > To answer your question, no in general more workers means more > parallelism and therefore faster execution. But that depends on a lot > of things. For example, if your process isn't parallelize to use all > available execution slots, adding more slots doesn't do anything. > > On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan <pradhandeep1...@gmail.com> > wrote: > > Yes, I am talking about standalone single node cluster. > > > > No, I am not increasing parallelism. I just wanted to know if it is natural. > > Does message passing across the workers account for the happenning? > > > > I am running SparkKMeans, just to validate one prediction model. I am using > > several data sets. I have a standalone mode. I am varying the workers from 1 > > to 16 > > > > On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen <so...@cloudera.com> wrote: > >> > >> I can imagine a few reasons. Adding workers might cause fewer tasks to > >> execute locally (?) So you may be execute more remotely. > >> > >> Are you increasing parallelism? for trivial jobs, chopping them up > >> further may cause you to pay more overhead of managing so many small > >> tasks, for no speed up in execution time. > >> > >> Can you provide any more specifics though? you haven't said what > >> you're running, what mode, how many workers, how long it takes, etc. > >> > >> On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan <pradhandeep1...@gmail.com> > >> wrote: > >> > Hi, > >> > I have been running some jobs in my local single node stand alone > >> > cluster. I > >> > am varying the worker instances for the same job, and the time taken for > >> > the > >> > job to complete increases with increase in the number of workers. I > >> > repeated > >> > some experiments varying the number of nodes in a cluster too and the > >> > same > >> > behavior is seen. > >> > Can the idea of worker instances be extrapolated to the nodes in a > >> > cluster? > >> > > >> > Thank You > > > > >