[ 
https://issues.apache.org/jira/browse/SPARK-32474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32474:
------------------------------------

    Assignee: Apache Spark

> NullAwareAntiJoin multi-column support
> --------------------------------------
>
>                 Key: SPARK-32474
>                 URL: https://issues.apache.org/jira/browse/SPARK-32474
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Leanken.Lin
>            Assignee: Apache Spark
>            Priority: Minor
>
> This is a follow up improvement of Issue SPARK-32290.
> In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to 
> BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but 
> it's only targeting on Single Column Case, because it's much more complicate 
> in multi column support.
> See. [http://www.vldb.org/pvldb/vol2/vldb09-423.pdf] Section 6
>  
> FYI, code logical for single and multi column is defined at
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql
>  
> For supporting multi column, I throw the following idea and see if is it 
> worth to do multi-column support with some trade off. I would need to do some 
> data expansion in HashedRelation, and i would call this new type of 
> HashedRelation as NullAwareHashedRelation.
>  
> In NullAwareHashedRelation, key with null column is allowed, which is 
> opposite in LongHashedRelation and UnsafeHashedRelation; And single key might 
> be expanded into 2^N - 1 records, (N refer to columnNum of the key). for 
> example, if there is a record
> (1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), 
> C(2,3) as a combination to copy origin key row, and setNull at target 
> position, and then insert into NullAwareHashedRelation. including the origin 
> key row, there will be 7 key row inserted as follow.
> (null, 2, 3)
> (1, null, 3)
> (1, 2, null)
> (null, null, 3)
> (null, 2, null)
> (1, null, null)
> (1, 2, 3)
>  
> with the expanded data we can extract a common pattern for both single and 
> multi column. allNull refer to a unsafeRow which has all null columns.
>  * buildSide is empty input => return all rows
>  * allNullColumnKey Exists In buildSide input => reject all rows
>  * if streamedSideRow.allNull is true => drop the row
>  * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
> => drop the row
>  * if streamedSideRow.allNull is false & notFindMatch in 
> NullAwareHashedRelation => return the row
>  
> this solution will sure make buildSide data expand to 2^N-1 times, but since 
> it is normally up to 2~3 column in NAAJ in normal production query, i suppose 
> that it's acceptable to expand buildSide data to around 7X. I would also have 
> a limitation of max column support for NAAJ, basically should not more than 
> 3. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to