Hi guys,

I'm trying to do a cross join (cartesian product) with 3 tables stored as
parquet. Each table has 1 column, a long key.

Table A has 60,000 keys with 1000 partitions
Table B has 1000 keys with 1 partition
Table C has 4 keys with 1 partition

The output should be 240million row combinations of the keys.

How should I partition and order the joins & more generally how does a
cartesian product work in Spark?

dfA.join(dfB).join(dfC)?

This seems very slow and sometimes results in the executors crashing (I'm
assuming with out of memory). I'm running with low parallelism (few cores
(5) big heap (30GB))

Is there another way to do this join? Should I somehow repartition the
records?

Cheers,
~N

Plan from the above operation;

== Physical Plan ==
CartesianProduct
 CartesianProduct
  Repartition 800, true
   Project [Rand -2281254270918005092
 AS _c0#159,KEYA#162L]
    PhysicalRDD [KEYA#162L], MapPartitionsRDD[148] at
  InMemoryColumnarTableScan [KEYB#117], (InMemoryRelation [KEYB#117], true,
2000, StorageLevel(true, true, false, true, 1), (PhysicalRDD [KEYB#55],
MapPartitionsRDD[26] at), None)
 InMemoryColumnarTableScan [KEYC#120], (InMemoryRelation [KEYC#120], true,
2000, StorageLevel(true, true, false, true, 1), (Repartition 1, true), None)

Reply via email to