[ https://issues.apache.org/jira/browse/SPARK-32494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-32494: ------------------------------------ Assignee: Apache Spark > Null Aware Anti Join Optimize Support Multi-Column > -------------------------------------------------- > > Key: SPARK-32494 > URL: https://issues.apache.org/jira/browse/SPARK-32494 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 3.0.0 > Reporter: Leanken.Lin > Assignee: Apache Spark > Priority: Major > > In Issue SPARK-32290, we managed to optimize BroadcastNestedLoopJoin into > BroadcastHashJoin within the Single-Column NAAJ scenario, by using hash > lookup instead of loop join. > It's simple to just fulfill a "NOT IN" logical when it's a single key, but > multi-column not in is much more complicated with all these null aware > comparison. > FYI, code logical for single and multi column is defined at > ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql > ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql > > Hence, proposed with a New type HashedRelation, NullAwareHashedRelation. > For NullAwareHashedRelation > # it will not skip anyNullColumn key like LongHashedRelation and > UnsafeHashedRelation do. > # while building NullAwareHashedRelation, will put extra keys into the > relation, just to make null aware columns comparison in hash lookup style. > the duplication would be 2^numKeys - 1 times, for example, if we are to > support NAAJ with 3 column join key, the buildSide would be expanded into > (2^3 - 1) times, 7X. > For example, if there is a UnsafeRow key (1,2,3) > In NullAware Mode, it should be expanded into 7 keys with extra C(3,1), > C(3,2) combinations, within the combinations, we duplicated these record with > null padding as following. > Original record > (1,2,3) > Extra record to be appended into NullAwareHashedRelation > (null, 2, 3) (1, null, 3) (1, 2, null) > (null, null, 3) (null, 2, null) (1, null, null)) > with the expanded data we can extract a common pattern for both single and > multi column. allNull refer to a unsafeRow which has all null columns. > * buildSide is empty input => return all rows > * allNullColumnKey Exists In buildSide input => reject all rows > * if streamedSideRow.allNull is true => drop the row > * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation > => drop the row > * if streamedSideRow.allNull is false & notFindMatch in > NullAwareHashedRelation => return the row > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org