[ 
https://issues.apache.org/jira/browse/SPARK-15453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated SPARK-15453:
--------------------------------
    Description: 
Datasource allows creation of bucketed and sorted tables but performing joins 
on such tables still does not utilize this metadata to produce optimal query 
plan.

As below, the `Exchange` and `Sort` can be avoided if the tables are known to 
be hashed + sorted on relevant columns.

{quote}
== Physical Plan ==
WholeStageCodegen
:  +- SortMergeJoin [j#20,k#21,i#22], [j#23,k#24,i#25], Inner, None
:     :- INPUT
:     +- INPUT
:- WholeStageCodegen
:  :  +- Sort [j#20 ASC,k#21 ASC,i#22 ASC], false, 0
:  :     +- INPUT
:  +- Exchange hashpartitioning(j#20, k#21, i#22, 200), None
:     +- WholeStageCodegen
:        :  +- Project [j#20,k#21,i#22]
:        :     +- Filter (isnotnull(k#21) && isnotnull(j#20))
:        :        +- Scan orc default.table7[j#20,k#21,i#22] Format: ORC, 
InputPaths: file:/XXXXXXX/table7, PushedFilters: [IsNotNull(k), IsNotNull(j)], 
ReadSchema: struct<j:int,k:string>
+- WholeStageCodegen
   :  +- Sort [j#23 ASC,k#24 ASC,i#25 ASC], false, 0
   :     +- INPUT
   +- Exchange hashpartitioning(j#23, k#24, i#25, 200), None
      +- WholeStageCodegen
         :  +- Project [j#23,k#24,i#25]
         :     +- Filter (isnotnull(k#24) && isnotnull(j#23))
         :        +- Scan orc default.table8[j#23,k#24,i#25] Format: ORC, 
InputPaths: file:/XXXXXXX/table8, PushedFilters: [IsNotNull(k), IsNotNull(j)], 
ReadSchema: struct<j:int,k:string>
{quote}

  was:
Datasource allows creation of bucketed and sorted tables but performing joins 
on such tables still does not utilize this metadata to produce optimal query 
plan.

As below, the `Exchange` and `Sort` can be avoided if the tables are known to 
be hashed + sorted on relevant columns.

```
== Physical Plan ==
WholeStageCodegen
:  +- SortMergeJoin [j#20,k#21,i#22], [j#23,k#24,i#25], Inner, None
:     :- INPUT
:     +- INPUT
:- WholeStageCodegen
:  :  +- Sort [j#20 ASC,k#21 ASC,i#22 ASC], false, 0
:  :     +- INPUT
:  +- Exchange hashpartitioning(j#20, k#21, i#22, 200), None
:     +- WholeStageCodegen
:        :  +- Project [j#20,k#21,i#22]
:        :     +- Filter (isnotnull(k#21) && isnotnull(j#20))
:        :        +- Scan orc default.table7[j#20,k#21,i#22] Format: ORC, 
InputPaths: file:/XXXXXXX/table7, PushedFilters: [IsNotNull(k), IsNotNull(j)], 
ReadSchema: struct<j:int,k:string>
+- WholeStageCodegen
   :  +- Sort [j#23 ASC,k#24 ASC,i#25 ASC], false, 0
   :     +- INPUT
   +- Exchange hashpartitioning(j#23, k#24, i#25, 200), None
      +- WholeStageCodegen
         :  +- Project [j#23,k#24,i#25]
         :     +- Filter (isnotnull(k#24) && isnotnull(j#23))
         :        +- Scan orc default.table8[j#23,k#24,i#25] Format: ORC, 
InputPaths: file:/XXXXXXX/table8, PushedFilters: [IsNotNull(k), IsNotNull(j)], 
ReadSchema: struct<j:int,k:string>
```


> Support for SMB Join
> --------------------
>
>                 Key: SPARK-15453
>                 URL: https://issues.apache.org/jira/browse/SPARK-15453
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Tejas Patil
>            Priority: Minor
>
> Datasource allows creation of bucketed and sorted tables but performing joins 
> on such tables still does not utilize this metadata to produce optimal query 
> plan.
> As below, the `Exchange` and `Sort` can be avoided if the tables are known to 
> be hashed + sorted on relevant columns.
> {quote}
> == Physical Plan ==
> WholeStageCodegen
> :  +- SortMergeJoin [j#20,k#21,i#22], [j#23,k#24,i#25], Inner, None
> :     :- INPUT
> :     +- INPUT
> :- WholeStageCodegen
> :  :  +- Sort [j#20 ASC,k#21 ASC,i#22 ASC], false, 0
> :  :     +- INPUT
> :  +- Exchange hashpartitioning(j#20, k#21, i#22, 200), None
> :     +- WholeStageCodegen
> :        :  +- Project [j#20,k#21,i#22]
> :        :     +- Filter (isnotnull(k#21) && isnotnull(j#20))
> :        :        +- Scan orc default.table7[j#20,k#21,i#22] Format: ORC, 
> InputPaths: file:/XXXXXXX/table7, PushedFilters: [IsNotNull(k), 
> IsNotNull(j)], ReadSchema: struct<j:int,k:string>
> +- WholeStageCodegen
>    :  +- Sort [j#23 ASC,k#24 ASC,i#25 ASC], false, 0
>    :     +- INPUT
>    +- Exchange hashpartitioning(j#23, k#24, i#25, 200), None
>       +- WholeStageCodegen
>          :  +- Project [j#23,k#24,i#25]
>          :     +- Filter (isnotnull(k#24) && isnotnull(j#23))
>          :        +- Scan orc default.table8[j#23,k#24,i#25] Format: ORC, 
> InputPaths: file:/XXXXXXX/table8, PushedFilters: [IsNotNull(k), 
> IsNotNull(j)], ReadSchema: struct<j:int,k:string>
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to