imho, you need to take into account size of your data too if your cluster is relatively small, you may cause memory pressure on your executors if trying to repartition to some #cores connected number of partitions
better to take some max between initial number of partitions(assuming your data is on hdfs with 64Mb block size) and between number you get from your formula On 29 July 2015 at 12:31, ponkin <alexey.pon...@ya.ru> wrote: > Hi Rahul, > > Where did you see such a recommendation? > I personally define partitions with the following formula > > partitions = nextPrimeNumberAbove( K*(--num-executors * --executor-cores ) > ) > > where > nextPrimeNumberAbove(x) - prime number which is greater than x > K - multiplicator to calculate start with 1 and encrease untill join > perfomance start to degrade > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Number-of-Partitions-Recommendations-tp24022p24059.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >