My take on the 2-3 tasks per CPU core is that you want to ensure you are
utilizing the cores to the max, which means it will help you with scaling
and performance. The question would be why not 1 task per core? The reason
is that you can probably get a good handle on the average execution time
per task but the execution time p90 + can be spiky. In which case you don't
want the long poll task (s) to slow down your entire batch (which is in
general what you would tune your application for). So by having 2-3 tasks
per CPU core, you can further break down the work to smaller chunks hence
completing tasks quicker and let the spark scheduler (which is low cost and
efficient based on my observation, it is never the bottleneck) do the work
of distributing the work among the tasks.
I have experimented with 1 task per core, 2-3 tasks per core and all the
way up to 20+ tasks per core. The performance difference was similar
between 3 tasks per core and 20+ tasks per core. But it does make a
difference in performance when you compare 1  task per core v/s 2-3 tasks
per core.

Hope this explanation makes sense.
Best,
Bharath


On Thu, Feb 9, 2017 at 2:11 PM, Ji Yan <ji...@drive.ai> wrote:

> Dear spark users,
>
> From this site https://spark.apache.org/docs/latest/tuning.html where it
> offers recommendation on setting the level of parallelism
>
> Clusters will not be fully utilized unless you set the level of
>> parallelism for each operation high enough. Spark automatically sets the
>> number of “map” tasks to run on each file according to its size (though you
>> can control it through optional parameters to SparkContext.textFile,
>> etc), and for distributed “reduce” operations, such as groupByKey and
>> reduceByKey, it uses the largest parent RDD’s number of partitions. You
>> can pass the level of parallelism as a second argument (see the
>> spark.PairRDDFunctions
>> <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions>
>>  documentation), or set the config property spark.default.parallelism to
>> change the default. *In general, we recommend 2-3 tasks per CPU core in
>> your cluster*.
>
>
> Do people have a general theory/intuition about why it is a good idea to
> have 2-3 tasks running per CPU core?
>
> Thanks
> Ji
>
> The information in this email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful.
>

Reply via email to