[jira] [Updated] (SPARK-32494) Null Aware Anti Join Optimize Support Multi-Column

Leanken.Lin (Jira) Thu, 30 Jul 2020 03:20:24 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-32494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Leanken.Lin updated SPARK-32494:
--------------------------------
    Description: 
In Issue SPARK-32290, we managed to optimize BroadcastNestedLoopJoin into 
BroadcastHashJoin within the Single-Column NAAJ scenario, by using hash lookup 
instead of loop join. 

It's simple to just fulfill a "NOT IN" logical when it's a single key, but 
multi-column not in is much more complicated with all these null aware 
comparison.

FYI, code logical for single and multi column is defined at

~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql

~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql

 

Hence, proposed with a New type HashedRelation, NullAwareHashedRelation. 

For NullAwareHashedRelation
 # it will not skip anyNullColumn key like LongHashedRelation and 
UnsafeHashedRelation do.
 # while building NullAwareHashedRelation, will put extra keys into the 
relation, just to make null aware columns comparison in hash lookup style.

the duplication would be 2^numKeys - 1 times, for example, if we are to support 
NAAJ with 3 column join key, the buildSide would be expanded into (2^3 - 1) 
times, 7X.

For example, if there is a UnsafeRow key (1,2,3)

In NullAware Mode, it should be expanded into 7 keys with extra C(3,1), C(3,2) 
combinations, within the combinations, we duplicated these record with null 
padding as following.

Original record

(1,2,3)

Extra record to be appended into NullAwareHashedRelation

(null, 2, 3) (1, null, 3) (1, 2, null)
 (null, null, 3) (null, 2, null) (1, null, null))

with the expanded data we can extract a common pattern for both single and 
multi column. allNull refer to a unsafeRow which has all null columns.
 * buildSide is empty input => return all rows
 * allNullColumnKey Exists In buildSide input => reject all rows
 * if streamedSideRow.allNull is true => drop the row
 * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
=> drop the row
 * if streamedSideRow.allNull is false & notFindMatch in 
NullAwareHashedRelation => return the row

 

 

  was:
In Issue [SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290], we 
managed to optimize BroadcastNestedLoopJoin into BroadcastHashJoin within the 
Single-Column NAAJ scenario, by using hash lookup instead of loop join. 

It's simple to just fulfill a "NOT IN" logical when it's a single key, but 
multi-column not in is much more complicated with all these null aware compare.

Hence, proposed with a New type HashedRelation, NullAwareHashedRelation. 

For NullAwareHashedRelation
 # it will not skip anyNullColumn key like LongHashedRelation and 
UnsafeHashedRelation
 # while building NullAwareHashedRelation, will put extra keys into the 
relation, just to make null aware columns comparison in hash lookup style.

the duplication would be 2^numKeys - 1, for example, if we are to support NAAJ 
with 3 column join key, the buildSide would be expanded into (2^3 - 1) times, 
7X.

For example, if there is a UnsafeRow key (1,2,3)


In NullAware Mode, it should be expanded into 7 keys with extra C(3,1), C(3,2) 
combinations, within the combinations, we duplicated these record with null 
padding as following.

Original record

(1,2,3)

Extra record to be appended into HashedRelation

(null, 2, 3) (1, null, 3) (1, 2, null)
(null, null, 3) (null, 2, null) (1, null, null))

with the expanded data we can extract a common pattern for both single and 
multi column. allNull refer to a unsafeRow which has all null columns.
 * buildSide is empty input => return all rows
 * allNullColumnKey Exists In buildSide input => reject all rows
 * if streamedSideRow.allNull is true => drop the row
 * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
=> drop the row
 * if streamedSideRow.allNull is false & notFindMatch in 
NullAwareHashedRelation => return the row

 

 


> Null Aware Anti Join Optimize Support Multi-Column
> --------------------------------------------------
>
>                 Key: SPARK-32494
>                 URL: https://issues.apache.org/jira/browse/SPARK-32494
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Leanken.Lin
>            Priority: Major
>
> In Issue SPARK-32290, we managed to optimize BroadcastNestedLoopJoin into 
> BroadcastHashJoin within the Single-Column NAAJ scenario, by using hash 
> lookup instead of loop join. 
> It's simple to just fulfill a "NOT IN" logical when it's a single key, but 
> multi-column not in is much more complicated with all these null aware 
> comparison.
> FYI, code logical for single and multi column is defined at
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql
>  
> Hence, proposed with a New type HashedRelation, NullAwareHashedRelation. 
> For NullAwareHashedRelation
>  # it will not skip anyNullColumn key like LongHashedRelation and 
> UnsafeHashedRelation do.
>  # while building NullAwareHashedRelation, will put extra keys into the 
> relation, just to make null aware columns comparison in hash lookup style.
> the duplication would be 2^numKeys - 1 times, for example, if we are to 
> support NAAJ with 3 column join key, the buildSide would be expanded into 
> (2^3 - 1) times, 7X.
> For example, if there is a UnsafeRow key (1,2,3)
> In NullAware Mode, it should be expanded into 7 keys with extra C(3,1), 
> C(3,2) combinations, within the combinations, we duplicated these record with 
> null padding as following.
> Original record
> (1,2,3)
> Extra record to be appended into NullAwareHashedRelation
> (null, 2, 3) (1, null, 3) (1, 2, null)
>  (null, null, 3) (null, 2, null) (1, null, null))
> with the expanded data we can extract a common pattern for both single and 
> multi column. allNull refer to a unsafeRow which has all null columns.
>  * buildSide is empty input => return all rows
>  * allNullColumnKey Exists In buildSide input => reject all rows
>  * if streamedSideRow.allNull is true => drop the row
>  * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
> => drop the row
>  * if streamedSideRow.allNull is false & notFindMatch in 
> NullAwareHashedRelation => return the row
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32494) Null Aware Anti Join Optimize Support Multi-Column

Reply via email to