Dear spark users,

>From this site where it
offers recommendation on setting the level of parallelism

Clusters will not be fully utilized unless you set the level of parallelism
> for each operation high enough. Spark automatically sets the number of
> “map” tasks to run on each file according to its size (though you can
> control it through optional parameters to SparkContext.textFile, etc),
> and for distributed “reduce” operations, such as groupByKey and
> reduceByKey, it uses the largest parent RDD’s number of partitions. You
> can pass the level of parallelism as a second argument (see the
> spark.PairRDDFunctions
> <>
>  documentation),
> or set the config property spark.default.parallelism to change the
> default. *In general, we recommend 2-3 tasks per CPU core in your cluster*
> .

Do people have a general theory/intuition about why it is a good idea to
have 2-3 tasks running per CPU core?



