c21 edited a comment on pull request #29342:
URL: https://github.com/apache/spark/pull/29342#issuecomment-670632581


   @agrawaldevesh - thank you for warm welcome, and excited to discuss and 
collaborate again here!
   
   > I am curious if the approach of storing the 'matched rows' out of band was 
considered ? The join algorithm could be extended to keep say an auxiliary 
struct of matched keys instead of populating this on the build side ? Since the 
build side hash tables are open addressed arrays, this auxiliary struct might 
be a bitset that stores the matched indices.
   
   Yes I agree that would a good optimization for space. TLDR is I think given 
this full outer shuffled hash join is a new feature, and we could keep it 
simple to begin with and optimize further if needed, a detailed comment 
[here](https://github.com/apache/spark/pull/29342#discussion_r467173696).
   
   > In addition how do you account for this extra memory usage on the driver ? 
Is it possible that planner thinks that the query will "fit" and runs the query 
but it later on OOMs because of this extra "column" ?
   
   That's a good question. Currently planner does not account for this extra 
boolean value overhead per row. However currently for BHJ and SHJ, [planner 
does not take into account for extra key overhead in hash-map as well, and it's 
just based on size of 
rows](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L354).
 So to improve planner side of code, we need more thought to how to improve it 
in the future.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to