Re: deciding Spark tasks & optimization resource

Gibson Mon, 29 Aug 2022 11:10:39 -0700

Hello Rajat,

Look up the spark *Pipelining* concept; any sequence of operations that
feed data directly into each other without need for shuffling will packed
into a single stage, ie select -> filter -> select (SparkSQL) ; map ->
filter -> map (RDD), for any operation that requires shuffling (sort,
group, reduce), a new stage will be created after each shuffle, but before
a new stage is created, these shuffle files are persisted to the local disk
(referred to as *Shuffle Persistence*), and and accessed by the group /
reduce tasks, this offers high availability in the sense that a group task
can be relaunched upon failure, in case there's no enough executors etc

Regarding the number of partitions, look up there's a
*spark.sql.shuffle.partition* parameter, is used to set the default number
of partitions output/created by a shuffle operation, default it 200.

Reference:
Spark : The Definitive Guide ; Bill Chambers & Matei Zaharia

On Mon, Aug 29, 2022 at 3:36 PM rajat kumar <kumar.rajat20...@gmail.com>
wrote:

> Hello Members,
>
> I have a query for spark stages:-
>
> why every stage has a different number of tasks/partitions in spark. Or
> how is it determined?
>
> Moreover, where can i see the improvements done in spark3+
>
>
> Thanks in advance
> Rajat
>

Re: deciding Spark tasks & optimization resource

Reply via email to