Also, If I take SparkPageRank for example (org.apache.spark.examples), there are various RDDs that are created and transformed in the code that is written. If I want to increase the number of partitions and test out, what is the optimum number of partitions that gives me the best performance, I have to change the number of partitions in each run, right? Now, there are various RDDs there, so, which RDD do I partition? In other words, if I partition the first RDD that is created from the data in HDFS, am I ensured that other RDDs that are transformed from this RDD will also be partitioned in the same way?
Thank You On Sun, Feb 22, 2015 at 10:02 AM, Deep Pradhan <pradhandeep1...@gmail.com> wrote: > >> So increasing Executors without increasing physical resources > If I have a 16 GB RAM system and then I allocate 1 GB for each executor, > and give number of executors as 8, then I am increasing the resource right? > In this case, how do you explain? > > Thank You > > On Sun, Feb 22, 2015 at 6:12 AM, Aaron Davidson <ilike...@gmail.com> > wrote: > >> Note that the parallelism (i.e., number of partitions) is just an upper >> bound on how much of the work can be done in parallel. If you have 200 >> partitions, then you can divide the work among between 1 and 200 cores and >> all resources will remain utilized. If you have more than 200 cores, >> though, then some will not be used, so you would want to increase >> parallelism further. (There are other rules-of-thumb -- for instance, it's >> generally good to have at least 2x more partitions than cores for straggler >> mitigation, but these are essentially just optimizations.) >> >> Further note that when you increase the number of Executors for the same >> set of resources (i.e., starting 10 Executors on a single machine instead >> of 1), you make Spark's job harder. Spark has to communicate in an >> all-to-all manner across Executors for shuffle operations, and it uses TCP >> sockets to do so whether or not the Executors happen to be on the same >> machine. So increasing Executors without increasing physical resources >> means Spark has to do more communication to do the same work. >> >> We expect that increasing the number of Executors by a factor of 10, >> given an increase in the number of physical resources by the same factor, >> would also improve performance by 10x. This is not always the case for the >> precise reason above (increased communication overhead), but typically we >> can get close. The actual observed improvement is very algorithm-dependent, >> though; for instance, some ML algorithms become hard to scale out past a >> certain point because the increase in communication overhead outweighs the >> increase in parallelism. >> >> On Sat, Feb 21, 2015 at 8:19 AM, Deep Pradhan <pradhandeep1...@gmail.com> >> wrote: >> >>> So, if I keep the number of instances constant and increase the degree >>> of parallelism in steps, can I expect the performance to increase? >>> >>> Thank You >>> >>> On Sat, Feb 21, 2015 at 9:07 PM, Deep Pradhan <pradhandeep1...@gmail.com >>> > wrote: >>> >>>> So, with the increase in the number of worker instances, if I also >>>> increase the degree of parallelism, will it make any difference? >>>> I can use this model even the other way round right? I can always >>>> predict the performance of an app with the increase in number of worker >>>> instances, the deterioration in performance, right? >>>> >>>> Thank You >>>> >>>> On Sat, Feb 21, 2015 at 8:52 PM, Deep Pradhan < >>>> pradhandeep1...@gmail.com> wrote: >>>> >>>>> Yes, I have decreased the executor memory. >>>>> But,if I have to do this, then I have to tweak around with the code >>>>> corresponding to each configuration right? >>>>> >>>>> On Sat, Feb 21, 2015 at 8:47 PM, Sean Owen <so...@cloudera.com> wrote: >>>>> >>>>>> "Workers" has a specific meaning in Spark. You are running many on one >>>>>> machine? that's possible but not usual. >>>>>> >>>>>> Each worker's executors have access to a fraction of your machine's >>>>>> resources then. If you're not increasing parallelism, maybe you're not >>>>>> actually using additional workers, so are using less resource for your >>>>>> problem. >>>>>> >>>>>> Or because the resulting executors are smaller, maybe you're hitting >>>>>> GC thrashing in these executors with smaller heaps. >>>>>> >>>>>> Or if you're not actually configuring the executors to use less >>>>>> memory, maybe you're over-committing your RAM and swapping? >>>>>> >>>>>> Bottom line, you wouldn't use multiple workers on one small standalone >>>>>> node. This isn't a good way to estimate performance on a distributed >>>>>> cluster either. >>>>>> >>>>>> On Sat, Feb 21, 2015 at 3:11 PM, Deep Pradhan < >>>>>> pradhandeep1...@gmail.com> wrote: >>>>>> > No, I just have a single node standalone cluster. >>>>>> > >>>>>> > I am not tweaking around with the code to increase parallelism. I >>>>>> am just >>>>>> > running SparkKMeans that is there in Spark-1.0.0 >>>>>> > I just wanted to know, if this behavior is natural. And if so, what >>>>>> causes >>>>>> > this? >>>>>> > >>>>>> > Thank you >>>>>> > >>>>>> > On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen <so...@cloudera.com> >>>>>> wrote: >>>>>> >> >>>>>> >> What's your storage like? are you adding worker machines that are >>>>>> >> remote from where the data lives? I wonder if it just means you are >>>>>> >> spending more and more time sending the data over the network as >>>>>> you >>>>>> >> try to ship more of it to more remote workers. >>>>>> >> >>>>>> >> To answer your question, no in general more workers means more >>>>>> >> parallelism and therefore faster execution. But that depends on a >>>>>> lot >>>>>> >> of things. For example, if your process isn't parallelize to use >>>>>> all >>>>>> >> available execution slots, adding more slots doesn't do anything. >>>>>> >> >>>>>> >> On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan < >>>>>> pradhandeep1...@gmail.com> >>>>>> >> wrote: >>>>>> >> > Yes, I am talking about standalone single node cluster. >>>>>> >> > >>>>>> >> > No, I am not increasing parallelism. I just wanted to know if it >>>>>> is >>>>>> >> > natural. >>>>>> >> > Does message passing across the workers account for the >>>>>> happenning? >>>>>> >> > >>>>>> >> > I am running SparkKMeans, just to validate one prediction model. >>>>>> I am >>>>>> >> > using >>>>>> >> > several data sets. I have a standalone mode. I am varying the >>>>>> workers >>>>>> >> > from 1 >>>>>> >> > to 16 >>>>>> >> > >>>>>> >> > On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen <so...@cloudera.com> >>>>>> wrote: >>>>>> >> >> >>>>>> >> >> I can imagine a few reasons. Adding workers might cause fewer >>>>>> tasks to >>>>>> >> >> execute locally (?) So you may be execute more remotely. >>>>>> >> >> >>>>>> >> >> Are you increasing parallelism? for trivial jobs, chopping them >>>>>> up >>>>>> >> >> further may cause you to pay more overhead of managing so many >>>>>> small >>>>>> >> >> tasks, for no speed up in execution time. >>>>>> >> >> >>>>>> >> >> Can you provide any more specifics though? you haven't said what >>>>>> >> >> you're running, what mode, how many workers, how long it takes, >>>>>> etc. >>>>>> >> >> >>>>>> >> >> On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan >>>>>> >> >> <pradhandeep1...@gmail.com> >>>>>> >> >> wrote: >>>>>> >> >> > Hi, >>>>>> >> >> > I have been running some jobs in my local single node stand >>>>>> alone >>>>>> >> >> > cluster. I >>>>>> >> >> > am varying the worker instances for the same job, and the >>>>>> time taken >>>>>> >> >> > for >>>>>> >> >> > the >>>>>> >> >> > job to complete increases with increase in the number of >>>>>> workers. I >>>>>> >> >> > repeated >>>>>> >> >> > some experiments varying the number of nodes in a cluster too >>>>>> and the >>>>>> >> >> > same >>>>>> >> >> > behavior is seen. >>>>>> >> >> > Can the idea of worker instances be extrapolated to the nodes >>>>>> in a >>>>>> >> >> > cluster? >>>>>> >> >> > >>>>>> >> >> > Thank You >>>>>> >> > >>>>>> >> > >>>>>> > >>>>>> > >>>>>> >>>>> >>>>> >>>> >>> >> >