Hello Rajat, Look up the spark *Pipelining* concept; any sequence of operations that feed data directly into each other without need for shuffling will packed into a single stage, ie select -> filter -> select (SparkSQL) ; map -> filter -> map (RDD), for any operation that requires shuffling (sort, group, reduce), a new stage will be created after each shuffle, but before a new stage is created, these shuffle files are persisted to the local disk (referred to as *Shuffle Persistence*), and and accessed by the group / reduce tasks, this offers high availability in the sense that a group task can be relaunched upon failure, in case there's no enough executors etc
Regarding the number of partitions, look up there's a *spark.sql.shuffle.partition* parameter, is used to set the default number of partitions output/created by a shuffle operation, default it 200. Reference: Spark : The Definitive Guide ; Bill Chambers & Matei Zaharia On Mon, Aug 29, 2022 at 3:36 PM rajat kumar <kumar.rajat20...@gmail.com> wrote: > Hello Members, > > I have a query for spark stages:- > > why every stage has a different number of tasks/partitions in spark. Or > how is it determined? > > Moreover, where can i see the improvements done in spark3+ > > > Thanks in advance > Rajat >