Dear spark users, >From this site https://spark.apache.org/docs/latest/tuning.html where it offers recommendation on setting the level of parallelism
Clusters will not be fully utilized unless you set the level of parallelism > for each operation high enough. Spark automatically sets the number of > “map” tasks to run on each file according to its size (though you can > control it through optional parameters to SparkContext.textFile, etc), > and for distributed “reduce” operations, such as groupByKey and > reduceByKey, it uses the largest parent RDD’s number of partitions. You > can pass the level of parallelism as a second argument (see the > spark.PairRDDFunctions > <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions> > documentation), > or set the config property spark.default.parallelism to change the > default. *In general, we recommend 2-3 tasks per CPU core in your cluster* > . Do people have a general theory/intuition about why it is a good idea to have 2-3 tasks running per CPU core? Thanks Ji -- The information in this email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful.