[ https://issues.apache.org/jira/browse/SPARK-31809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117771#comment-17117771 ]
Yuming Wang commented on SPARK-31809: ------------------------------------- {noformat} hive> EXPLAIN SELECT t1.* FROM t1 JOIN t2 ON coalesce(t1.c1, t1.c2)=t2.c1; OK STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3 STAGE PLANS: Stage: Stage-4 Map Reduce Local Work Alias -> Map Local Tables: $hdt$_0:t1 Fetch Operator limit: -1 Alias -> Map Local Operator Tree: $hdt$_0:t1 TableScan alias: t1 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: COALESCE(c1,c2) is not null (type: boolean) Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Select Operator expressions: c1 (type: string), c2 (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE HashTable Sink Operator keys: 0 COALESCE(_col0,_col1) (type: string) 1 _col0 (type: string) Stage: Stage-3 Map Reduce Map Operator Tree: TableScan alias: t2 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: c1 is not null (type: boolean) Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Select Operator expressions: c1 (type: string) outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 COALESCE(_col0,_col1) (type: string) 1 _col0 (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Execution mode: vectorized Local Work: Map Reduce Local Work Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Time taken: 1.281 seconds, Fetched: 67 row(s) {noformat} > Infer IsNotNull for all children of NullIntolerant expressions > -------------------------------------------------------------- > > Key: SPARK-31809 > URL: https://issues.apache.org/jira/browse/SPARK-31809 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.1.0 > Reporter: Yuming Wang > Assignee: Yuming Wang > Priority: Major > Attachments: default.png, infer.png > > > We should infer {{IsNotNull}} for all children of {{NullIntolerant}} > expressions. For example: > {code:sql} > CREATE TABLE t1(c1 string, c2 string); > CREATE TABLE t2(c1 string, c2 string); > EXPLAIN SELECT t1.* FROM t1 JOIN t2 ON coalesce(t1.c1, t1.c2)=t2.c1; > {code} > {noformat} > == Physical Plan == > *(4) Project [c1#5, c2#6] > +- *(4) SortMergeJoin [coalesce(c1#5, c2#6)], [c1#7], Inner > :- *(1) Sort [coalesce(c1#5, c2#6) ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(coalesce(c1#5, c2#6), 200), true, [id=#33] > : +- Scan hive default.t1 [c1#5, c2#6], HiveTableRelation > `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#5, > c2#6], Statistics(sizeInBytes=8.0 EiB) > +- *(3) Sort [c1#7 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(c1#7, 200), true, [id=#46] > +- *(2) Filter isnotnull(c1#7) > +- Scan hive default.t2 [c1#7], HiveTableRelation `default`.`t2`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#7, c2#8], > Statistics(sizeInBytes=8.0 EiB) > {noformat} > We should infer {{coalesce(t1.c1, t1.c2) IS NOT NULL}} to improve query > performance: > {noformat} > == Physical Plan == > *(5) Project [c1#23, c2#24] > +- *(5) SortMergeJoin [coalesce(c1#23, c2#24)], [c1#25], Inner > :- *(2) Sort [coalesce(c1#23, c2#24) ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(coalesce(c1#23, c2#24), 200), true, > [id=#95] > : +- *(1) Filter isnotnull(coalesce(c1#23, c2#24)) > : +- Scan hive default.t1 [c1#23, c2#24], HiveTableRelation > `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#23, > c2#24], Statistics(sizeInBytes=8.0 EiB) > +- *(4) Sort [c1#25 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(c1#25, 200), true, [id=#103] > +- *(3) Filter isnotnull(c1#25) > +- Scan hive default.t2 [c1#25], HiveTableRelation > `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#25, > c2#26], Statistics(sizeInBytes=8.0 EiB) > {noformat} > Real performance test case: > !default.png! !infer.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org