Hi All We are joining large tables using spark sql and running into shuffle issues. We have explored multiple options - using coalesce to reduce number of partitions, tuning various parameters like disk buffer, reducing data in chunks etc. which all seem to help btw. What I would like to know is, is having a pair rdd over regular rdd one of the solutions ? Will it make the joining more efficient as spark can shuffle better since it knows the key? Logically speaking I think it should help but I haven't found any evidence on the internet including the spark sql documentation.
It is a lot of effort for us to try this approach and weight the performance as we need to register the output as tables to proceed using them. Hence would appreciate inputs from the community before proceeding. Regards Sunita Koppar