[ 
https://issues.apache.org/jira/browse/SPARK-32399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298414#comment-17298414
 ] 

Wensheng Wang commented on SPARK-32399:
---------------------------------------

[~chengsu] [~cloud_fan] 

It seems there is a potential bug in the current implementation for 
FullOuterJoin in ShuffledHashJoinExec:

The implementation of FullOuterJoin always assumes the left of "JoinedRow" is 
the left input of SHJ, vice versa. This is different from other Joins' 
implementation which assumes the left of "JoinedRow" is the streaming side and 
the right is the build side. However all these join implementations share the 
same "boundCondition“ which assumes the left of "JoinedRow" is the streaming 
side and the right is the build side. This could cause the FullOuterJoin 
evaluates non-equi join conditions in the wrong way. 

To verify this, simply remove the "if (joinType != FullOuter)" condition in the 
OuterJoinSuite which tests FullOuter join implementation for SHJ, it fails with 
the output. Please help to confirm. Thanks.

 

!Screen Shot 2021-03-09 at 3.06.30 PM.png!

> Support full outer join in shuffled hash join
> ---------------------------------------------
>
>                 Key: SPARK-32399
>                 URL: https://issues.apache.org/jira/browse/SPARK-32399
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Cheng Su
>            Assignee: Cheng Su
>            Priority: Minor
>             Fix For: 3.1.0
>
>         Attachments: Screen Shot 2020-10-14 at 11.08.37 PM.png, Screen Shot 
> 2020-10-14 at 12.30.07 PM.png, Screen Shot 2021-03-09 at 3.06.30 PM.png
>
>
> Currently for SQL full outer join, spark always does a sort merge join no 
> matter of how large the join children size are. Inspired by recent discussion 
> in [https://github.com/apache/spark/pull/29130#discussion_r456502678] and 
> [https://github.com/apache/spark/pull/29181], I think we can support full 
> outer join in shuffled hash join in a way that - when looking up stream side 
> keys from build side {{HashedRelation}}. Mark this info inside build side 
> {{HashedRelation}}, and after reading all rows from stream side, output all 
> non-matching rows from build side based on modified {{HashedRelation}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to