Also, If I take SparkPageRank for example (org.apache.spark.examples),
there are various RDDs that are created and transformed in the code that is
written. If I want to increase the number of partitions and test out, what
is the optimum number of partitions that gives me the best performance, I
have to change the number of partitions in each run, right? Now, there are
various RDDs there, so, which RDD do I partition? In other words, if I
partition the first RDD that is created from the data in HDFS, am I ensured
that other RDDs that are transformed from this RDD will also be partitioned
in the same way?

Thank You

On Sun, Feb 22, 2015 at 10:02 AM, Deep Pradhan <pradhandeep1...@gmail.com>
wrote:

> >> So increasing Executors without increasing physical resources
> If I have a 16 GB RAM system and then I allocate 1 GB for each executor,
> and give number of executors as 8, then I am increasing the resource right?
> In this case, how do you explain?
>
> Thank You
>
> On Sun, Feb 22, 2015 at 6:12 AM, Aaron Davidson <ilike...@gmail.com>
> wrote:
>
>> Note that the parallelism (i.e., number of partitions) is just an upper
>> bound on how much of the work can be done in parallel. If you have 200
>> partitions, then you can divide the work among between 1 and 200 cores and
>> all resources will remain utilized. If you have more than 200 cores,
>> though, then some will not be used, so you would want to increase
>> parallelism further. (There are other rules-of-thumb -- for instance, it's
>> generally good to have at least 2x more partitions than cores for straggler
>> mitigation, but these are essentially just optimizations.)
>>
>> Further note that when you increase the number of Executors for the same
>> set of resources (i.e., starting 10 Executors on a single machine instead
>> of 1), you make Spark's job harder. Spark has to communicate in an
>> all-to-all manner across Executors for shuffle operations, and it uses TCP
>> sockets to do so whether or not the Executors happen to be on the same
>> machine. So increasing Executors without increasing physical resources
>> means Spark has to do more communication to do the same work.
>>
>> We expect that increasing the number of Executors by a factor of 10,
>> given an increase in the number of physical resources by the same factor,
>> would also improve performance by 10x. This is not always the case for the
>> precise reason above (increased communication overhead), but typically we
>> can get close. The actual observed improvement is very algorithm-dependent,
>> though; for instance, some ML algorithms become hard to scale out past a
>> certain point because the increase in communication overhead outweighs the
>> increase in parallelism.
>>
>> On Sat, Feb 21, 2015 at 8:19 AM, Deep Pradhan <pradhandeep1...@gmail.com>
>> wrote:
>>
>>> So, if I keep the number of instances constant and increase the degree
>>> of parallelism in steps, can I expect the performance to increase?
>>>
>>> Thank You
>>>
>>> On Sat, Feb 21, 2015 at 9:07 PM, Deep Pradhan <pradhandeep1...@gmail.com
>>> > wrote:
>>>
>>>> So, with the increase in the number of worker instances, if I also
>>>> increase the degree of parallelism, will it make any difference?
>>>> I can use this model even the other way round right? I can always
>>>> predict the performance of an app with the increase in number of worker
>>>> instances, the deterioration in performance, right?
>>>>
>>>> Thank You
>>>>
>>>> On Sat, Feb 21, 2015 at 8:52 PM, Deep Pradhan <
>>>> pradhandeep1...@gmail.com> wrote:
>>>>
>>>>> Yes, I have decreased the executor memory.
>>>>> But,if I have to do this, then I have to tweak around with the code
>>>>> corresponding to each configuration right?
>>>>>
>>>>> On Sat, Feb 21, 2015 at 8:47 PM, Sean Owen <so...@cloudera.com> wrote:
>>>>>
>>>>>> "Workers" has a specific meaning in Spark. You are running many on one
>>>>>> machine? that's possible but not usual.
>>>>>>
>>>>>> Each worker's executors have access to a fraction of your machine's
>>>>>> resources then. If you're not increasing parallelism, maybe you're not
>>>>>> actually using additional workers, so are using less resource for your
>>>>>> problem.
>>>>>>
>>>>>> Or because the resulting executors are smaller, maybe you're hitting
>>>>>> GC thrashing in these executors with smaller heaps.
>>>>>>
>>>>>> Or if you're not actually configuring the executors to use less
>>>>>> memory, maybe you're over-committing your RAM and swapping?
>>>>>>
>>>>>> Bottom line, you wouldn't use multiple workers on one small standalone
>>>>>> node. This isn't a good way to estimate performance on a distributed
>>>>>> cluster either.
>>>>>>
>>>>>> On Sat, Feb 21, 2015 at 3:11 PM, Deep Pradhan <
>>>>>> pradhandeep1...@gmail.com> wrote:
>>>>>> > No, I just have a single node standalone cluster.
>>>>>> >
>>>>>> > I am not tweaking around with the code to increase parallelism. I
>>>>>> am just
>>>>>> > running SparkKMeans that is there in Spark-1.0.0
>>>>>> > I just wanted to know, if this behavior is natural. And if so, what
>>>>>> causes
>>>>>> > this?
>>>>>> >
>>>>>> > Thank you
>>>>>> >
>>>>>> > On Sat, Feb 21, 2015 at 8:32 PM, Sean Owen <so...@cloudera.com>
>>>>>> wrote:
>>>>>> >>
>>>>>> >> What's your storage like? are you adding worker machines that are
>>>>>> >> remote from where the data lives? I wonder if it just means you are
>>>>>> >> spending more and more time sending the data over the network as
>>>>>> you
>>>>>> >> try to ship more of it to more remote workers.
>>>>>> >>
>>>>>> >> To answer your question, no in general more workers means more
>>>>>> >> parallelism and therefore faster execution. But that depends on a
>>>>>> lot
>>>>>> >> of things. For example, if your process isn't parallelize to use
>>>>>> all
>>>>>> >> available execution slots, adding more slots doesn't do anything.
>>>>>> >>
>>>>>> >> On Sat, Feb 21, 2015 at 2:51 PM, Deep Pradhan <
>>>>>> pradhandeep1...@gmail.com>
>>>>>> >> wrote:
>>>>>> >> > Yes, I am talking about standalone single node cluster.
>>>>>> >> >
>>>>>> >> > No, I am not increasing parallelism. I just wanted to know if it
>>>>>> is
>>>>>> >> > natural.
>>>>>> >> > Does message passing across the workers account for the
>>>>>> happenning?
>>>>>> >> >
>>>>>> >> > I am running SparkKMeans, just to validate one prediction model.
>>>>>> I am
>>>>>> >> > using
>>>>>> >> > several data sets. I have a standalone mode. I am varying the
>>>>>> workers
>>>>>> >> > from 1
>>>>>> >> > to 16
>>>>>> >> >
>>>>>> >> > On Sat, Feb 21, 2015 at 8:14 PM, Sean Owen <so...@cloudera.com>
>>>>>> wrote:
>>>>>> >> >>
>>>>>> >> >> I can imagine a few reasons. Adding workers might cause fewer
>>>>>> tasks to
>>>>>> >> >> execute locally (?) So you may be execute more remotely.
>>>>>> >> >>
>>>>>> >> >> Are you increasing parallelism? for trivial jobs, chopping them
>>>>>> up
>>>>>> >> >> further may cause you to pay more overhead of managing so many
>>>>>> small
>>>>>> >> >> tasks, for no speed up in execution time.
>>>>>> >> >>
>>>>>> >> >> Can you provide any more specifics though? you haven't said what
>>>>>> >> >> you're running, what mode, how many workers, how long it takes,
>>>>>> etc.
>>>>>> >> >>
>>>>>> >> >> On Sat, Feb 21, 2015 at 2:37 PM, Deep Pradhan
>>>>>> >> >> <pradhandeep1...@gmail.com>
>>>>>> >> >> wrote:
>>>>>> >> >> > Hi,
>>>>>> >> >> > I have been running some jobs in my local single node stand
>>>>>> alone
>>>>>> >> >> > cluster. I
>>>>>> >> >> > am varying the worker instances for the same job, and the
>>>>>> time taken
>>>>>> >> >> > for
>>>>>> >> >> > the
>>>>>> >> >> > job to complete increases with increase in the number of
>>>>>> workers. I
>>>>>> >> >> > repeated
>>>>>> >> >> > some experiments varying the number of nodes in a cluster too
>>>>>> and the
>>>>>> >> >> > same
>>>>>> >> >> > behavior is seen.
>>>>>> >> >> > Can the idea of worker instances be extrapolated to the nodes
>>>>>> in a
>>>>>> >> >> > cluster?
>>>>>> >> >> >
>>>>>> >> >> > Thank You
>>>>>> >> >
>>>>>> >> >
>>>>>> >
>>>>>> >
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to