I am trying to understand the Spark Architecture for my upcoming
certification, however there seems to be conflicting information available.

https://stackoverflow.com/questions/47782099/what-is-the-relationship-between-tasks-and-partitions

Does Spark assign a Spark partition to only a single corresponding Spark
partition ?

In other words, is the number of Spark tasks for a job equal to the number
of Spark partitions ? (Provided of course there are no shuffles)

If so, a following question is :

1) Is the reason in Spark we can get OOMs ? Because a partition may not be
able to be loaded into RAM (provided its coming from an intermediate step
like a groupBy) ?

2) What is the purpose of spark.task.cpus ? It does not make sense for more
than one thread (or more than one cpu) to be working on a single
partition of data. So this number should always be 1 right ?

Need some help. Thanks.

-- 
Regards,
Sreyan Chakravarty

Reply via email to