I am trying to understand the Spark Architecture for my upcoming certification, however there seems to be conflicting information available.
https://stackoverflow.com/questions/47782099/what-is-the-relationship-between-tasks-and-partitions Does Spark assign a Spark partition to only a single corresponding Spark partition ? In other words, is the number of Spark tasks for a job equal to the number of Spark partitions ? (Provided of course there are no shuffles) If so, a following question is : 1) Is the reason in Spark we can get OOMs ? Because a partition may not be able to be loaded into RAM (provided its coming from an intermediate step like a groupBy) ? 2) What is the purpose of spark.task.cpus ? It does not make sense for more than one thread (or more than one cpu) to be working on a single partition of data. So this number should always be 1 right ? Need some help. Thanks. -- Regards, Sreyan Chakravarty