Hi Team, As per my understanding, assume it to be a large dataset. When we apply joins, data from different executors are shuffled in such a way that the same "keys" are landed in one partition.
So, this is done for both the dataframes, right? For eg: Key A for df1 will be sorted and kept in one partition and Key A for df2 will be sorted and kept in another partition and then it will be compared and merged? I know that for shuffle hash join keys for both data frames are merged under a single partition since the smaller data is copied on each and every executor. Also, where would be the join operation performed? on another worker node or it is performed on the driver side? Somebody, please help me to understand this by correcting me w.r.t my points or just adding an explanation to it. TIA, Sid