I'm trying to do a brute force fuzzy join where I compare N records against
N other records, for N^2 total comparisons.

The table is medium size and fits in memory, so I collect it and put it
into a broadcast variable.

The other copy of the table is in an RDD. I am basically calling the RDD
map operation, and each record in the RDD takes the broadcasted table and
FILTERS it. There appears to be large GC happening, so I suspect that huge
repeated data deletion of copies of the broadcast table is causing GC.

Is there a way to fix this pattern?

Thanks,
Arun

Reply via email to