[jira] [Commented] (SPARK-32494) Null Aware Anti Join Optimize Support Multi-Column

Runkang He (Jira) Tue, 10 Oct 2023 05:13:21 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-32494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773659#comment-17773659
 ]


Runkang He commented on SPARK-32494:
------------------------------------

Hi [~leanken], is this feature available in Spark's master branch? I don't find 
it in master now.

> Null Aware Anti Join Optimize Support Multi-Column
> --------------------------------------------------
>
>                 Key: SPARK-32494
>                 URL: https://issues.apache.org/jira/browse/SPARK-32494
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Leanken.Lin
>            Priority: Major
>
> In Issue SPARK-32290, we managed to optimize BroadcastNestedLoopJoin into 
> BroadcastHashJoin within the Single-Column NAAJ scenario, by using hash 
> lookup instead of loop join. 
> It's simple to just fulfill a "NOT IN" logical when it's a single key, but 
> multi-column not in is much more complicated with all these null aware 
> comparison.
> FYI, code logical for single and multi column is defined at
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql
>  
> Hence, proposed with a New type HashedRelation, NullAwareHashedRelation. 
> For NullAwareHashedRelation
>  # it will not skip anyNullColumn key like LongHashedRelation and 
> UnsafeHashedRelation do.
>  # while building NullAwareHashedRelation, will put extra keys into the 
> relation, just to make null aware columns comparison in hash lookup style.
> the duplication would be 2^numKeys - 1 times, for example, if we are to 
> support NAAJ with 3 column join key, the buildSide would be expanded into 
> (2^3 - 1) times, 7X.
> For example, if there is a UnsafeRow key (1,2,3)
> In NullAware Mode, it should be expanded into 7 keys with extra C(3,1), 
> C(3,2) combinations, within the combinations, we duplicated these record with 
> null padding as following.
> Original record
> (1,2,3)
> Extra record to be appended into NullAwareHashedRelation
> (null, 2, 3) (1, null, 3) (1, 2, null)
>  (null, null, 3) (null, 2, null) (1, null, null))
> with the expanded data we can extract a common pattern for both single and 
> multi column. allNull refer to a unsafeRow which has all null columns.
>  * buildSide is empty input => return all rows
>  * allNullColumnKey Exists In buildSide input => reject all rows
>  * if streamedSideRow.allNull is true => drop the row
>  * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
> => drop the row
>  * if streamedSideRow.allNull is false & notFindMatch in 
> NullAwareHashedRelation => return the row
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32494) Null Aware Anti Join Optimize Support Multi-Column

Reply via email to