HashedRelation Memory Pressure on Broadcast Joins

Matt Cheah Tue, 01 Mar 2016 17:18:24 -0800

Hi everyone,

I had a quick question regarding our implementation of UnsafeHashedRelation and 
HashedRelation<https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala>.
 It appears that we copy the rows that we’ve collected into memory upon 
inserting them into the hash table in UnsafeHashedRelation#apply(). I was 
wondering why we are copying the rows every time? I can’t imagine these rows 
being mutable in this scenario.


The context is that I’m looking into a case where a small data frame should fit 
in the driver’s memory, but my driver ran out of memory after I increased the 
autoBroadcastJoinThreshold. YourKit is indicating that this logic is consuming 
more memory than my driver can handle.

Thanks,

-Matt Cheah

HashedRelation Memory Pressure on Broadcast Joins

Reply via email to