Mostafa Mokhtar created HIVE-8769:
-------------------------------------

             Summary: Physical optimizer : Incorrect CE results in a shuffle 
join instead of a Map join (PK/FK pattern not detected)
                 Key: HIVE-8769
                 URL: https://issues.apache.org/jira/browse/HIVE-8769
             Project: Hive
          Issue Type: Bug
          Components: Physical Optimizer
    Affects Versions: 0.14.0
            Reporter: Mostafa Mokhtar
            Assignee: Prasanth J
             Fix For: 0.15.0


TPC-DS Q82 is running slower than hive 13 because the join type is not correct.

The estimate for item x inventory x date_dim is 227 Million rows while the 
actual is  3K rows.

Hive 13 finishes in  753  seconds.
Hive 14 finishes in  1,267  seconds.
Hive 14 + force map join finished in 431 seconds.

Query
{code}
select  i_item_id
       ,i_item_desc
       ,i_current_price
 from item, inventory, date_dim, store_sales
 where i_current_price between 30 and 30+30
 and inv_item_sk = i_item_sk
 and d_date_sk=inv_date_sk
 and d_date between '2002-05-30' and '2002-07-30'
 and i_manufact_id in (437,129,727,663)
 and inv_quantity_on_hand between 100 and 500
 and ss_item_sk = i_item_sk
 group by i_item_id,i_item_desc,i_current_price
 order by i_item_id
 limit 100
{code}

Plan 
{code}
STAGE PLANS:
  Stage: Stage-1
    Tez
      Edges:
        Map 7 <- Map 1 (BROADCAST_EDGE), Map 2 (BROADCAST_EDGE)
        Reducer 4 <- Map 3 (SIMPLE_EDGE), Map 7 (SIMPLE_EDGE)
        Reducer 5 <- Reducer 4 (SIMPLE_EDGE)
        Reducer 6 <- Reducer 5 (SIMPLE_EDGE)
      DagName: mmokhtar_20141106005353_7a2eb8df-12ff-4fe9-89b4-30f1e4e3fb90:1
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: item
                  filterExpr: ((i_current_price BETWEEN 30 AND 60 and 
(i_manufact_id) IN (437, 129, 727, 663)) and i_item_sk is not null) (type: 
boolean)
                  Statistics: Num rows: 462000 Data size: 663862160 Basic 
stats: COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: ((i_current_price BETWEEN 30 AND 60 and 
(i_manufact_id) IN (437, 129, 727, 663)) and i_item_sk is not null) (type: 
boolean)
                    Statistics: Num rows: 115500 Data size: 34185680 Basic 
stats: COMPLETE Column stats: COMPLETE
                    Select Operator
                      expressions: i_item_sk (type: int), i_item_id (type: 
string), i_item_desc (type: string), i_current_price (type: float)
                      outputColumnNames: _col0, _col1, _col2, _col3
                      Statistics: Num rows: 115500 Data size: 33724832 Basic 
stats: COMPLETE Column stats: COMPLETE
                      Reduce Output Operator
                        key expressions: _col0 (type: int)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: int)
                        Statistics: Num rows: 115500 Data size: 33724832 Basic 
stats: COMPLETE Column stats: COMPLETE
                        value expressions: _col1 (type: string), _col2 (type: 
string), _col3 (type: float)
            Execution mode: vectorized
        Map 2 
            Map Operator Tree:
                TableScan
                  alias: date_dim
                  filterExpr: (d_date BETWEEN '2002-05-30' AND '2002-07-30' and 
d_date_sk is not null) (type: boolean)
                  Statistics: Num rows: 73049 Data size: 81741831 Basic stats: 
COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: (d_date BETWEEN '2002-05-30' AND '2002-07-30' 
and d_date_sk is not null) (type: boolean)
                    Statistics: Num rows: 36524 Data size: 3579352 Basic stats: 
COMPLETE Column stats: COMPLETE
                    Select Operator
                      expressions: d_date_sk (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 36524 Data size: 146096 Basic 
stats: COMPLETE Column stats: COMPLETE
                      Reduce Output Operator
                        key expressions: _col0 (type: int)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: int)
                        Statistics: Num rows: 36524 Data size: 146096 Basic 
stats: COMPLETE Column stats: COMPLETE
                      Select Operator
                        expressions: _col0 (type: int)
                        outputColumnNames: _col0
                        Statistics: Num rows: 36524 Data size: 146096 Basic 
stats: COMPLETE Column stats: COMPLETE
                        Group By Operator
                          keys: _col0 (type: int)
                          mode: hash
                          outputColumnNames: _col0
                          Statistics: Num rows: 18262 Data size: 73048 Basic 
stats: COMPLETE Column stats: COMPLETE
                          Dynamic Partitioning Event Operator
                            Target Input: inventory
                            Partition key expr: inv_date_sk
                            Statistics: Num rows: 18262 Data size: 73048 Basic 
stats: COMPLETE Column stats: COMPLETE
                            Target column: inv_date_sk
                            Target Vertex: Map 7
            Execution mode: vectorized
        Map 3 
            Map Operator Tree:
                TableScan
                  alias: store_sales
                  filterExpr: ss_item_sk is not null (type: boolean)
                  Statistics: Num rows: 82510879939 Data size: 6873789738208 
Basic stats: COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: ss_item_sk is not null (type: boolean)
                    Statistics: Num rows: 82510879939 Data size: 330043519756 
Basic stats: COMPLETE Column stats: COMPLETE
                    Select Operator
                      expressions: ss_item_sk (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 82510879939 Data size: 330043519756 
Basic stats: COMPLETE Column stats: COMPLETE
                      Reduce Output Operator
                        key expressions: _col0 (type: int)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: int)
                        Statistics: Num rows: 82510879939 Data size: 
330043519756 Basic stats: COMPLETE Column stats: COMPLETE
            Execution mode: vectorized
        Map 7 
            Map Operator Tree:
                TableScan
                  alias: inventory
                  filterExpr: (inv_quantity_on_hand BETWEEN 100 AND 500 and 
inv_item_sk is not null) (type: boolean)
                  Statistics: Num rows: 1,627,857,000 Data size: 19208695084 
Basic stats: COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: (inv_quantity_on_hand BETWEEN 100 AND 500 and 
inv_item_sk is not null) (type: boolean)
                    Statistics: Num rows: 813,928,500 Data size: 9604347540 
Basic stats: COMPLETE Column stats: COMPLETE
                    Select Operator
                      expressions: inv_item_sk (type: int), inv_date_sk (type: 
int)
                      outputColumnNames: _col0, _col2
                      Statistics: Num rows: 813,928,500 Data size: 6511428000 
Basic stats: COMPLETE Column stats: COMPLETE
                      Map Join Operator
                        condition map:
                             Inner Join 0 to 1
                        condition expressions:
                          0 {_col2}
                          1 {_col0} {_col1} {_col2} {_col3}
                        keys:
                          0 _col0 (type: int)
                          1 _col0 (type: int)
                        outputColumnNames: _col2, _col3, _col4, _col5, _col6
                        input vertices:
                          1 Map 1
                        Statistics: Num rows: 203,482,128 Data size: 
59416781376 Basic stats: COMPLETE Column stats: COMPLETE
                        Map Join Operator
                          condition map:
                               Inner Join 0 to 1
                          condition expressions:
                            0 {_col3} {_col4} {_col5} {_col6}
                            1 
                          keys:
                            0 _col2 (type: int)
                            1 _col0 (type: int)
                          outputColumnNames: _col3, _col4, _col5, _col6
                          input vertices:
                            1 Map 2
                          Statistics: Num rows: 227,514,273 Data size: 
66434167716 Basic stats: COMPLETE Column stats: COMPLETE
                          Select Operator
                            expressions: _col3 (type: int), _col4 (type: 
string), _col5 (type: string), _col6 (type: float)
                            outputColumnNames: _col3, _col4, _col5, _col6
                            Statistics: Num rows: 227514273 Data size: 
66434167716 Basic stats: COMPLETE Column stats: COMPLETE
                            Reduce Output Operator
                              key expressions: _col3 (type: int)
                              sort order: +
                              Map-reduce partition columns: _col3 (type: int)
                              Statistics: Num rows: 227514273 Data size: 
66434167716 Basic stats: COMPLETE Column stats: COMPLETE
                              value expressions: _col4 (type: string), _col5 
(type: string), _col6 (type: float)
            Execution mode: vectorized
        Reducer 4 
            Reduce Operator Tree:
              Merge Join Operator
                condition map:
                     Inner Join 0 to 1
                condition expressions:
                  0 
                  1 {VALUE._col3} {VALUE._col4} {VALUE._col5}
                outputColumnNames: _col5, _col6, _col7
                Statistics: Num rows: 5156859392 Data size: 1485175504896 Basic 
stats: COMPLETE Column stats: COMPLETE
                Select Operator
                  expressions: _col5 (type: string), _col6 (type: string), 
_col7 (type: float)
                  outputColumnNames: _col0, _col1, _col2
                  Statistics: Num rows: 5156859392 Data size: 1485175504896 
Basic stats: COMPLETE Column stats: COMPLETE
                  Group By Operator
                    keys: _col0 (type: string), _col1 (type: string), _col2 
(type: float)
                    mode: hash
                    outputColumnNames: _col0, _col1, _col2
                    Statistics: Num rows: 5156859392 Data size: 1485175504896 
Basic stats: COMPLETE Column stats: COMPLETE
                    Reduce Output Operator
                      key expressions: _col0 (type: string), _col1 (type: 
string), _col2 (type: float)
                      sort order: +++
                      Map-reduce partition columns: _col0 (type: string), _col1 
(type: string), _col2 (type: float)
                      Statistics: Num rows: 5156859392 Data size: 1485175504896 
Basic stats: COMPLETE Column stats: COMPLETE
        Reducer 5 
            Reduce Operator Tree:
              Group By Operator
                keys: KEY._col0 (type: string), KEY._col1 (type: string), 
KEY._col2 (type: float)
                mode: mergepartial
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 5156859392 Data size: 1485175504896 Basic 
stats: COMPLETE Column stats: COMPLETE
                Select Operator
                  expressions: _col0 (type: string), _col1 (type: string), 
_col2 (type: float)
                  outputColumnNames: _col0, _col1, _col2
                  Statistics: Num rows: 5156859392 Data size: 1485175504896 
Basic stats: COMPLETE Column stats: COMPLETE
                  Reduce Output Operator
                    key expressions: _col0 (type: string)
                    sort order: +
                    Statistics: Num rows: 5156859392 Data size: 1485175504896 
Basic stats: COMPLETE Column stats: COMPLETE
                    TopN Hash Memory Usage: 0.04
                    value expressions: _col1 (type: string), _col2 (type: float)
            Execution mode: vectorized
        Reducer 6 
            Reduce Operator Tree:
              Select Operator
                expressions: KEY.reducesinkkey0 (type: string), VALUE._col0 
(type: string), VALUE._col1 (type: float)
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 5156859392 Data size: 1485175504896 Basic 
stats: COMPLETE Column stats: COMPLETE
                Limit
                  Number of rows: 100
                  Statistics: Num rows: 100 Data size: 28800 Basic stats: 
COMPLETE Column stats: COMPLETE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 100 Data size: 28800 Basic stats: 
COMPLETE Column stats: COMPLETE
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
            Execution mode: vectorized

  Stage: Stage-0
    Fetch Operator
      limit: 100
      Processor Tree:
        ListSink
{code}

Actual rows counts 
{code}
VERTICES         TOTAL_TASKS  FAILED_ATTEMPTS KILLED_TASKS DURATION_SECONDS    
CPU_TIME_MILLIS     GC_TIME_MILLIS  INPUT_RECORDS   OUTPUT_RECORDS 
Map 1                      1                0            0             1.01     
         1,280                  0        462,000               65
Map 2                      1                0            0             0.41     
           400                 23         10,000               62
Map 3                   2574                0            0           947.79     
   442,220,640          1,887,714 82,510,879,939   82,510,879,939
Map 7                      9                0            0           869.22     
        42,490              1,215     56,133,127            3,081
Reducer 4               1009                0            0           389.75     
    69,471,510          1,149,529 82,510,883,020               33
Reducer 5                253                0            0            93.73     
       938,930             26,150             33               33
Reducer 6                  1                0            0             2.08     
           730                 10             33               33
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to