Re: HashedRelation Memory Pressure on Broadcast Joins

Davies Liu Wed, 02 Mar 2016 10:16:12 -0800

UnsafeHashedRelation and HashedRelation could also be used in Executor
(for non-broadcast hash join), then the UnsafeRow could come from
UnsafeProjection,
so We should copy the rows for safety.

We could have a smarter copy() for UnsafeRow (avoid the copy if it's
already copied),
but I don't think this copy here will increase the memory pressure.
The total memory
will be determined by how many rows are stored in the hash tables.

In general, if you do not have enough memory, just don't increase
autoBroadcastJoinThreshold,
or the performance could be worse because of full GC.

Sometimes the tables looks small as compressed files (for example,
parquet file),
once it's loaded into memory, it could required much more memory than the size
of file on disk.

On Tue, Mar 1, 2016 at 5:17 PM, Matt Cheah <mch...@palantir.com> wrote:
> Hi everyone,
>
> I had a quick question regarding our implementation of UnsafeHashedRelation
> and HashedRelation. It appears that we copy the rows that we’ve collected
> into memory upon inserting them into the hash table in
> UnsafeHashedRelation#apply(). I was wondering why we are copying the rows
> every time? I can’t imagine these rows being mutable in this scenario.
>
> The context is that I’m looking into a case where a small data frame should
> fit in the driver’s memory, but my driver ran out of memory after I
> increased the autoBroadcastJoinThreshold. YourKit is indicating that this
> logic is consuming more memory than my driver can handle.
>
> Thanks,
>
> -Matt Cheah

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: HashedRelation Memory Pressure on Broadcast Joins

Reply via email to