[ 
https://issues.apache.org/jira/browse/SPARK-31809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117771#comment-17117771
 ] 

Yuming Wang commented on SPARK-31809:
-------------------------------------


{noformat}
hive> EXPLAIN SELECT t1.* FROM t1 JOIN t2 ON coalesce(t1.c1, t1.c2)=t2.c1;
OK
STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        $hdt$_0:t1
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        $hdt$_0:t1
          TableScan
            alias: t1
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
            Filter Operator
              predicate: COALESCE(c1,c2) is not null (type: boolean)
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
              Select Operator
                expressions: c1 (type: string), c2 (type: string)
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
Column stats: NONE
                HashTable Sink Operator
                  keys:
                    0 COALESCE(_col0,_col1) (type: string)
                    1 _col0 (type: string)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t2
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
            Filter Operator
              predicate: c1 is not null (type: boolean)
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
              Select Operator
                expressions: c1 (type: string)
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
Column stats: NONE
                Map Join Operator
                  condition map:
                       Inner Join 0 to 1
                  keys:
                    0 COALESCE(_col0,_col1) (type: string)
                    1 _col0 (type: string)
                  outputColumnNames: _col0, _col1
                  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
Column stats: NONE
                    table:
                        input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Execution mode: vectorized
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Time taken: 1.281 seconds, Fetched: 67 row(s)
{noformat}


> Infer IsNotNull for all children of NullIntolerant expressions
> --------------------------------------------------------------
>
>                 Key: SPARK-31809
>                 URL: https://issues.apache.org/jira/browse/SPARK-31809
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Yuming Wang
>            Assignee: Yuming Wang
>            Priority: Major
>         Attachments: default.png, infer.png
>
>
> We should infer {{IsNotNull}} for all children of {{NullIntolerant}} 
> expressions. For example:
> {code:sql}
> CREATE TABLE t1(c1 string, c2 string);
> CREATE TABLE t2(c1 string, c2 string);
> EXPLAIN SELECT t1.* FROM t1 JOIN t2 ON coalesce(t1.c1, t1.c2)=t2.c1;
> {code}
> {noformat}
> == Physical Plan ==
> *(4) Project [c1#5, c2#6]
> +- *(4) SortMergeJoin [coalesce(c1#5, c2#6)], [c1#7], Inner
>    :- *(1) Sort [coalesce(c1#5, c2#6) ASC NULLS FIRST], false, 0
>    :  +- Exchange hashpartitioning(coalesce(c1#5, c2#6), 200), true, [id=#33]
>    :     +- Scan hive default.t1 [c1#5, c2#6], HiveTableRelation 
> `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#5, 
> c2#6], Statistics(sizeInBytes=8.0 EiB)
>    +- *(3) Sort [c1#7 ASC NULLS FIRST], false, 0
>       +- Exchange hashpartitioning(c1#7, 200), true, [id=#46]
>          +- *(2) Filter isnotnull(c1#7)
>             +- Scan hive default.t2 [c1#7], HiveTableRelation `default`.`t2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#7, c2#8], 
> Statistics(sizeInBytes=8.0 EiB)
> {noformat}
> We should infer {{coalesce(t1.c1, t1.c2) IS NOT NULL}} to improve query 
> performance:
> {noformat}
> == Physical Plan ==
> *(5) Project [c1#23, c2#24]
> +- *(5) SortMergeJoin [coalesce(c1#23, c2#24)], [c1#25], Inner
>    :- *(2) Sort [coalesce(c1#23, c2#24) ASC NULLS FIRST], false, 0
>    :  +- Exchange hashpartitioning(coalesce(c1#23, c2#24), 200), true, 
> [id=#95]
>    :     +- *(1) Filter isnotnull(coalesce(c1#23, c2#24))
>    :        +- Scan hive default.t1 [c1#23, c2#24], HiveTableRelation 
> `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#23, 
> c2#24], Statistics(sizeInBytes=8.0 EiB)
>    +- *(4) Sort [c1#25 ASC NULLS FIRST], false, 0
>       +- Exchange hashpartitioning(c1#25, 200), true, [id=#103]
>          +- *(3) Filter isnotnull(c1#25)
>             +- Scan hive default.t2 [c1#25], HiveTableRelation 
> `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#25, 
> c2#26], Statistics(sizeInBytes=8.0 EiB)
> {noformat}
> Real performance test case:
>  !default.png!  !infer.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to