Your solution is an easy, pragmatic one, however there are many factors 
involved – and it’s not guaranteed to work for other data sets.

It depends on:

  *   The distribution of data on the keys. Random session ids will naturally 
distribute better than “customer names” - you can mitigate this with custom 
partitioners if the downstream algorithm does not care
     *   E.g. See the optimizations used in the page rank algorithm that group 
pages from the same domain together on the premise that most links go to pages 
from the same domain
     *   This is ultimately a hard problem to solve generically …
  *   The number of partitions – the rule of thumb here is to use 2-3 times the 
number of cores, on the following premises:
     *   More granular task will use the cores more efficiently from a 
scheduling perspective (less idle spots)
     *   Also, they will work with less data and avoid OOM problems when the 
data for a task is larger than executor working memory
     *   Through experimentation you will find the sweet spot, when you get 
into hundreds of partitions you may incur scheduling overhead…
  *   Data locality and shuffling + spark locality wait settings
     *   The scheduler might decide to take advantage of free cores in the 
cluster and schedule an off-node processing and you can control how long it 
waits through the spark.locality.wait settings

Hope this helps,
-adrian

From: Khaled Ammar
Date: Wednesday, November 4, 2015 at 4:03 PM
To: Adrian Tanase
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>"
Subject: Re: Why some executors are lazy?

Thank you Adrian,

The dataset is indeed skewed. My concern was that some executors do not 
participate in computation at all. I understand that executors finish tasks 
sequentially. Therefore, using more executors allow for better parallelism.

I managed to force all executors to participate by increasing number of 
partitions. My guess is, the scheduler preferred to reduce number of machines 
participating in the computation to decrease network overhead.

Do you think my analysis is correct? How should one decide on number of 
partitions? Does it depend on the workload or dataset or both ?

Thanks,
-Khaled


On Wed, Nov 4, 2015 at 7:21 AM, Adrian Tanase 
<atan...@adobe.com<mailto:atan...@adobe.com>> wrote:
If some of the operations required involve shuffling and partitioning, it might 
mean that the data set is skewed to specific partitions which will create hot 
spotting on certain executors.

-adrian

From: Khaled Ammar
Date: Tuesday, November 3, 2015 at 11:43 PM
To: "user@spark.apache.org<mailto:user@spark.apache.org>"
Subject: Why some executors are lazy?

Hi,

I'm using the most recent Spark version on a standalone setup of 16+1 machines.

While running GraphX workloads, I found that some executors are lazy? They 
*rarely* participate in computation. This causes some other executors to do 
their work. This behavior is consistent in all iterations and even in the data 
loading step. Only two specific executors do not participate in most 
computations.


Does any one know how to fix that?


More details:
Each machine has 4 cores. I set number of partitions to be 3*16. Each executor 
was supposed to do 3 tasks, but few of them end up working on 4 task instead, 
which causes delay in computation.



--
Thanks,
-Khaled



--
Thanks,
-Khaled

Reply via email to