[ 
https://issues.apache.org/jira/browse/SPARK-32494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32494:
------------------------------------

    Assignee:     (was: Apache Spark)

> Null Aware Anti Join Optimize Support Multi-Column
> --------------------------------------------------
>
>                 Key: SPARK-32494
>                 URL: https://issues.apache.org/jira/browse/SPARK-32494
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Leanken.Lin
>            Priority: Major
>
> In Issue SPARK-32290, we managed to optimize BroadcastNestedLoopJoin into 
> BroadcastHashJoin within the Single-Column NAAJ scenario, by using hash 
> lookup instead of loop join. 
> It's simple to just fulfill a "NOT IN" logical when it's a single key, but 
> multi-column not in is much more complicated with all these null aware 
> comparison.
> FYI, code logical for single and multi column is defined at
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql
>  
> Hence, proposed with a New type HashedRelation, NullAwareHashedRelation. 
> For NullAwareHashedRelation
>  # it will not skip anyNullColumn key like LongHashedRelation and 
> UnsafeHashedRelation do.
>  # while building NullAwareHashedRelation, will put extra keys into the 
> relation, just to make null aware columns comparison in hash lookup style.
> the duplication would be 2^numKeys - 1 times, for example, if we are to 
> support NAAJ with 3 column join key, the buildSide would be expanded into 
> (2^3 - 1) times, 7X.
> For example, if there is a UnsafeRow key (1,2,3)
> In NullAware Mode, it should be expanded into 7 keys with extra C(3,1), 
> C(3,2) combinations, within the combinations, we duplicated these record with 
> null padding as following.
> Original record
> (1,2,3)
> Extra record to be appended into NullAwareHashedRelation
> (null, 2, 3) (1, null, 3) (1, 2, null)
>  (null, null, 3) (null, 2, null) (1, null, null))
> with the expanded data we can extract a common pattern for both single and 
> multi column. allNull refer to a unsafeRow which has all null columns.
>  * buildSide is empty input => return all rows
>  * allNullColumnKey Exists In buildSide input => reject all rows
>  * if streamedSideRow.allNull is true => drop the row
>  * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
> => drop the row
>  * if streamedSideRow.allNull is false & notFindMatch in 
> NullAwareHashedRelation => return the row
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to