Hi guys, I'm trying to do a cross join (cartesian product) with 3 tables stored as parquet. Each table has 1 column, a long key.
Table A has 60,000 keys with 1000 partitions Table B has 1000 keys with 1 partition Table C has 4 keys with 1 partition The output should be 240million row combinations of the keys. How should I partition and order the joins & more generally how does a cartesian product work in Spark? dfA.join(dfB).join(dfC)? This seems very slow and sometimes results in the executors crashing (I'm assuming with out of memory). I'm running with low parallelism (few cores (5) big heap (30GB)) Is there another way to do this join? Should I somehow repartition the records? Cheers, ~N Plan from the above operation; == Physical Plan == CartesianProduct CartesianProduct Repartition 800, true Project [Rand -2281254270918005092 AS _c0#159,KEYA#162L] PhysicalRDD [KEYA#162L], MapPartitionsRDD[148] at InMemoryColumnarTableScan [KEYB#117], (InMemoryRelation [KEYB#117], true, 2000, StorageLevel(true, true, false, true, 1), (PhysicalRDD [KEYB#55], MapPartitionsRDD[26] at), None) InMemoryColumnarTableScan [KEYC#120], (InMemoryRelation [KEYC#120], true, 2000, StorageLevel(true, true, false, true, 1), (Repartition 1, true), None)