What is the best heuristic for setting the number of partitions/task on an
RDD based on the size of the RDD in memory?

The Spark docs say that the number of partitions/tasks should be 2-3x the
number of CPU cores but this does not make sense for all data sizes.
Sometimes, this number is way to much and slows down the executor because
of overhead.

Reply via email to