Understanding about joins in spark

Sid Mon, 27 Jun 2022 01:20:59 -0700

Hi Team,

As per my understanding, assume it to be a large dataset. When we apply
joins, data from different executors are shuffled in such a way that the
same "keys" are landed in one partition.


So, this is done for both the dataframes, right?  For eg: Key A for df1
will be sorted and kept in one partition and Key A for df2 will be sorted
and kept in another partition and then it will be compared and merged?

I know that for shuffle hash join keys for both data frames are merged
under a single partition since the smaller data is copied on each and every
executor.

Also, where would be the join operation performed? on another worker node
or it is performed on the driver side?

Somebody, please help me to understand this by correcting me w.r.t my
points or just adding an explanation to it.

TIA,
Sid

Understanding about joins in spark

Reply via email to