What is the best heuristic for setting the number of partitions/task on an RDD based on the size of the RDD in memory?
The Spark docs say that the number of partitions/tasks should be 2-3x the number of CPU cores but this does not make sense for all data sizes. Sometimes, this number is way to much and slows down the executor because of overhead.