[GitHub] [spark] c21 edited a comment on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

GitBox Fri, 07 Aug 2020 10:46:22 -0700


c21 edited a comment on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-670632581

@agrawaldevesh - thank you for warm welcome, and excited to discuss and
collaborate again here!

> I am curious if the approach of storing the 'matched rows' out of band was
considered ? The join algorithm could be extended to keep say an auxiliary
struct of matched keys instead of populating this on the build side ? Since the
build side hash tables are open addressed arrays, this auxiliary struct might
be a bitset that stores the matched indices.

Yes I agree that would a good optimization for space. TLDR is I think given
this full outer shuffled hash join is a new feature, and we could keep it
simple to begin with and optimize further if needed, a detailed comment
[here](https://github.com/apache/spark/pull/29342#discussion_r467173696).

> In addition how do you account for this extra memory usage on the driver ?
Is it possible that planner thinks that the query will "fit" and runs the query
but it later on OOMs because of this extra "column" ?

That's a good question. Currently planner does not account for this extra
boolean value overhead per row. However currently for BHJ and SHJ, [planner
does not take into account for extra key overhead in hash-map as well, and it's
just based on size of
rows](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L354).
So to improve planner side of code, we need more thought to how to improve it
in the future.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] c21 edited a comment on pull request #29342: [SPARK-32399][SQL] Full outer shuffled hash join

Reply via email to