Hi All

We are joining large tables using spark sql and running into shuffle
issues. We have explored multiple options - using coalesce to reduce number
of partitions, tuning various parameters like disk buffer, reducing data in
chunks etc. which all seem to help btw. What I would like to know is,
is having a pair rdd over regular rdd one of the solutions ? Will it make
the joining more efficient as spark can shuffle better since it knows the
key? Logically speaking I think it should help but I haven't found any
evidence on the internet including the spark sql documentation.

It is a lot of effort for us to try this approach and weight the
performance as we need to register the output as tables to proceed using
them. Hence would appreciate inputs from the community before proceeding.


Regards
Sunita Koppar

Reply via email to