Dear spark users,

>From this site https://spark.apache.org/docs/latest/tuning.html where it
offers recommendation on setting the level of parallelism

Clusters will not be fully utilized unless you set the level of parallelism
> for each operation high enough. Spark automatically sets the number of
> “map” tasks to run on each file according to its size (though you can
> control it through optional parameters to SparkContext.textFile, etc),
> and for distributed “reduce” operations, such as groupByKey and
> reduceByKey, it uses the largest parent RDD’s number of partitions. You
> can pass the level of parallelism as a second argument (see the
> spark.PairRDDFunctions
> <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions>
>  documentation),
> or set the config property spark.default.parallelism to change the
> default. *In general, we recommend 2-3 tasks per CPU core in your cluster*
> .


Do people have a general theory/intuition about why it is a good idea to
have 2-3 tasks running per CPU core?

Thanks
Ji

-- 
 

The information in this email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this email 
by anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be 
taken in reliance on it, is prohibited and may be unlawful.

Reply via email to