[ https://issues.apache.org/jira/browse/SPARK-32474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Leanken.Lin resolved SPARK-32474. --------------------------------- Resolution: Duplicate > NullAwareAntiJoin multi-column support > -------------------------------------- > > Key: SPARK-32474 > URL: https://issues.apache.org/jira/browse/SPARK-32474 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.0.0 > Reporter: Leanken.Lin > Priority: Minor > > This is a follow up improvement of Issue SPARK-32290. > In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to > BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but > it's only targeting on Single Column Case, because it's much more complicate > in multi column support. > See. [http://www.vldb.org/pvldb/vol2/vldb09-423.pdf] Section 6 > > FYI, code logical for single and multi column is defined at > ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql > ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql > > For supporting multi column, I throw the following idea and see if is it > worth to do multi-column support with some trade off. I would need to do some > data expansion in HashedRelation, and i would call this new type of > HashedRelation as NullAwareHashedRelation. > > In NullAwareHashedRelation, key with null column is allowed, which is > opposite in LongHashedRelation and UnsafeHashedRelation; And single key might > be expanded into 2^N - 1 records, (N refer to columnNum of the key). for > example, if there is a record > (1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), > C(2,3) as a combination to copy origin key row, and setNull at target > position, and then insert into NullAwareHashedRelation. including the origin > key row, there will be 7 key row inserted as follow. > (null, 2, 3) > (1, null, 3) > (1, 2, null) > (null, null, 3) > (null, 2, null) > (1, null, null) > (1, 2, 3) > > with the expanded data we can extract a common pattern for both single and > multi column. allNull refer to a unsafeRow which has all null columns. > * buildSide is empty input => return all rows > * allNullColumnKey Exists In buildSide input => reject all rows > * if streamedSideRow.allNull is true => drop the row > * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation > => drop the row > * if streamedSideRow.allNull is false & notFindMatch in > NullAwareHashedRelation => return the row > > this solution will sure make buildSide data expand to 2^N-1 times, but since > it is normally up to 2~3 column in NAAJ in normal production query, i suppose > that it's acceptable to expand buildSide data to around 7X. I would also have > a limitation of max column support for NAAJ, basically should not more than > 3. > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org