I'm trying to do a brute force fuzzy join where I compare N records against N other records, for N^2 total comparisons.
The table is medium size and fits in memory, so I collect it and put it into a broadcast variable. The other copy of the table is in an RDD. I am basically calling the RDD map operation, and each record in the RDD takes the broadcasted table and FILTERS it. There appears to be large GC happening, so I suspect that huge repeated data deletion of copies of the broadcast table is causing GC. Is there a way to fix this pattern? Thanks, Arun