c21 edited a comment on pull request #29342: URL: https://github.com/apache/spark/pull/29342#issuecomment-670632581
@agrawaldevesh - thank you for warm welcome, and excited to discuss and collaborate again here! > I am curious if the approach of storing the 'matched rows' out of band was considered ? The join algorithm could be extended to keep say an auxiliary struct of matched keys instead of populating this on the build side ? Since the build side hash tables are open addressed arrays, this auxiliary struct might be a bitset that stores the matched indices. Yes I agree that would a good optimization for space. TLDR is I think given this full outer shuffled hash join is a new feature, and we could keep it simple to begin with and optimize further if needed, a detailed comment [here](https://github.com/apache/spark/pull/29342#discussion_r467173696). > In addition how do you account for this extra memory usage on the driver ? Is it possible that planner thinks that the query will "fit" and runs the query but it later on OOMs because of this extra "column" ? That's a good question. Currently planner does not account for this extra boolean value overhead per row. However currently for BHJ and SHJ, [planner does not take into account for extra key overhead in hash-map as well, and it's just based on size of rows](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L354). So to improve planner side of code, we need more thought to how to improve it in the future. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org