[
https://issues.apache.org/jira/browse/HIVE-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mostafa Mokhtar updated HIVE-8196:
----------------------------------
Description:
To make the best out of dynamic partition pruning joins should be on the
partitioning columns which results in dynamically pruning the partitions from
the fact table based on the qualifying column keys from the dimension table,
this type of joins negatively effects on cardinality estimates with fetch
column stats enabled.
Currently we don't have statistics for partition columns and as a result NDV is
set to row count, doing that negatively affects the estimated join selectivity
from the join.
Workaround is to capture statistics for partition columns or use number of
partitions incase dynamic partitioning is used.
In StatsUtils.getColStatisticsFromExpression is where count distincts gets set
to row count
{code}
if (encd.getIsPartitionColOrVirtualCol()) {
// vitual columns
colType = encd.getTypeInfo().getTypeName();
countDistincts = numRows;
oi = encd.getWritableObjectInspector();
{code}
Query used to repro the issue :
{code}
set hive.stats.fetch.column.stats=ture;
set hive.tez.dynamic.partition.pruning=true;
explain select d_date
from store_sales, date_dim
where
store_sales.ss_sold_date_sk = date_dim.d_date_sk and
date_dim.d_year = 1998;
{code}
Plan
{code}
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Tez
Edges:
Map 1 <- Map 2 (BROADCAST_EDGE)
DagName: mmokhtar_20140919180404_945d29f5-d041-4420-9666-1c5d64fa6540:8
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: store_sales
filterExpr: ss_sold_date_sk is not null (type: boolean)
Statistics: Num rows: 550076554 Data size: 47370018816 Basic
stats: COMPLETE Column stats: COMPLETE
Map Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {ss_sold_date_sk}
1 {d_date_sk} {d_date}
keys:
0 ss_sold_date_sk (type: int)
1 d_date_sk (type: int)
outputColumnNames: _col22, _col26, _col28
input vertices:
1 Map 2
Statistics: Num rows: 652 Data size: 66504 Basic stats:
COMPLETE Column stats: COMPLETE
Filter Operator
predicate: (_col22 = _col26) (type: boolean)
Statistics: Num rows: 326 Data size: 33252 Basic stats:
COMPLETE Column stats: COMPLETE
Select Operator
expressions: _col28 (type: string)
outputColumnNames: _col0
Statistics: Num rows: 326 Data size: 30644 Basic stats:
COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
Statistics: Num rows: 326 Data size: 30644 Basic
stats: COMPLETE Column stats: COMPLETE
table:
input format:
org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Execution mode: vectorized
Map 2
Map Operator Tree:
TableScan
alias: date_dim
filterExpr: (d_date_sk is not null and (d_year = 1998))
(type: boolean)
Statistics: Num rows: 73049 Data size: 81741831 Basic stats:
COMPLETE Column stats: COMPLETE
Filter Operator
predicate: (d_date_sk is not null and (d_year = 1998))
(type: boolean)
Statistics: Num rows: 652 Data size: 66504 Basic stats:
COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: d_date_sk (type: int)
sort order: +
Map-reduce partition columns: d_date_sk (type: int)
Statistics: Num rows: 652 Data size: 66504 Basic stats:
COMPLETE Column stats: COMPLETE
value expressions: d_date (type: string)
Select Operator
expressions: d_date_sk (type: int)
outputColumnNames: _col0
Statistics: Num rows: 652 Data size: 0 Basic stats:
PARTIAL Column stats: COMPLETE
Group By Operator
keys: _col0 (type: int)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 652 Data size: 0 Basic stats:
PARTIAL Column stats: COMPLETE
Dynamic Partitioning Event Operator
Target Input: store_sales
Partition key expr: ss_sold_date_sk
Statistics: Num rows: 652 Data size: 0 Basic stats:
PARTIAL Column stats: COMPLETE
Target column: ss_sold_date_sk
Target Vertex: Map 1
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
{code}
was:
To make the best out of dynamic partition pruning joins should be on the
partitioning columns which results in dynamically pruning the partitions from
the fact table based on the qualifying column keys from the dimension table,
this type of joins negatively effects on cardinality estimates with fetch
column stats enabled.
Currently we don't have statistics for partition columns and as a result NDV is
set to row count, doing that negatively affects the estimated join selectivity
from the join.
Workaround is to capture statistics for partition columns or use number of
partitions incase dynamic partitioning is used.
In StatsUtils.getColStatisticsFromExpression is where count distincts gets set
to row count
{code}
if (encd.getIsPartitionColOrVirtualCol()) {
// vitual columns
colType = encd.getTypeInfo().getTypeName();
countDistincts = numRows;
oi = encd.getWritableObjectInspector();
{code}
Query used to repro the issue :
{code}
set hive.stats.fetch.column.stats=ture;
set hive.tez.dynamic.partition.pruning=true;
explain select d_date
from store_sales, date_dim
where
store_sales.ss_sold_date_sk = date_dim.d_date_sk and
date_dim.d_year = 1998;
{code}
Plan
> Joining on partition columns with fetch column stats enabled results it very
> small CE which negatively affects query performance
> ---------------------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-8196
> URL: https://issues.apache.org/jira/browse/HIVE-8196
> Project: Hive
> Issue Type: Bug
> Components: Physical Optimizer
> Affects Versions: 0.14.0
> Reporter: Mostafa Mokhtar
> Assignee: Prasanth J
> Priority: Critical
> Labels: performance
> Fix For: 0.14.0
>
>
> To make the best out of dynamic partition pruning joins should be on the
> partitioning columns which results in dynamically pruning the partitions from
> the fact table based on the qualifying column keys from the dimension table,
> this type of joins negatively effects on cardinality estimates with fetch
> column stats enabled.
> Currently we don't have statistics for partition columns and as a result NDV
> is set to row count, doing that negatively affects the estimated join
> selectivity from the join.
> Workaround is to capture statistics for partition columns or use number of
> partitions incase dynamic partitioning is used.
> In StatsUtils.getColStatisticsFromExpression is where count distincts gets
> set to row count
> {code}
> if (encd.getIsPartitionColOrVirtualCol()) {
> // vitual columns
> colType = encd.getTypeInfo().getTypeName();
> countDistincts = numRows;
> oi = encd.getWritableObjectInspector();
> {code}
> Query used to repro the issue :
> {code}
> set hive.stats.fetch.column.stats=ture;
> set hive.tez.dynamic.partition.pruning=true;
> explain select d_date
> from store_sales, date_dim
> where
> store_sales.ss_sold_date_sk = date_dim.d_date_sk and
> date_dim.d_year = 1998;
> {code}
> Plan
> {code}
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 depends on stages: Stage-1
> STAGE PLANS:
> Stage: Stage-1
> Tez
> Edges:
> Map 1 <- Map 2 (BROADCAST_EDGE)
> DagName: mmokhtar_20140919180404_945d29f5-d041-4420-9666-1c5d64fa6540:8
> Vertices:
> Map 1
> Map Operator Tree:
> TableScan
> alias: store_sales
> filterExpr: ss_sold_date_sk is not null (type: boolean)
> Statistics: Num rows: 550076554 Data size: 47370018816
> Basic stats: COMPLETE Column stats: COMPLETE
> Map Join Operator
> condition map:
> Inner Join 0 to 1
> condition expressions:
> 0 {ss_sold_date_sk}
> 1 {d_date_sk} {d_date}
> keys:
> 0 ss_sold_date_sk (type: int)
> 1 d_date_sk (type: int)
> outputColumnNames: _col22, _col26, _col28
> input vertices:
> 1 Map 2
> Statistics: Num rows: 652 Data size: 66504 Basic stats:
> COMPLETE Column stats: COMPLETE
> Filter Operator
> predicate: (_col22 = _col26) (type: boolean)
> Statistics: Num rows: 326 Data size: 33252 Basic stats:
> COMPLETE Column stats: COMPLETE
> Select Operator
> expressions: _col28 (type: string)
> outputColumnNames: _col0
> Statistics: Num rows: 326 Data size: 30644 Basic
> stats: COMPLETE Column stats: COMPLETE
> File Output Operator
> compressed: false
> Statistics: Num rows: 326 Data size: 30644 Basic
> stats: COMPLETE Column stats: COMPLETE
> table:
> input format:
> org.apache.hadoop.mapred.TextInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Execution mode: vectorized
> Map 2
> Map Operator Tree:
> TableScan
> alias: date_dim
> filterExpr: (d_date_sk is not null and (d_year = 1998))
> (type: boolean)
> Statistics: Num rows: 73049 Data size: 81741831 Basic
> stats: COMPLETE Column stats: COMPLETE
> Filter Operator
> predicate: (d_date_sk is not null and (d_year = 1998))
> (type: boolean)
> Statistics: Num rows: 652 Data size: 66504 Basic stats:
> COMPLETE Column stats: COMPLETE
> Reduce Output Operator
> key expressions: d_date_sk (type: int)
> sort order: +
> Map-reduce partition columns: d_date_sk (type: int)
> Statistics: Num rows: 652 Data size: 66504 Basic stats:
> COMPLETE Column stats: COMPLETE
> value expressions: d_date (type: string)
> Select Operator
> expressions: d_date_sk (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 652 Data size: 0 Basic stats:
> PARTIAL Column stats: COMPLETE
> Group By Operator
> keys: _col0 (type: int)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 652 Data size: 0 Basic stats:
> PARTIAL Column stats: COMPLETE
> Dynamic Partitioning Event Operator
> Target Input: store_sales
> Partition key expr: ss_sold_date_sk
> Statistics: Num rows: 652 Data size: 0 Basic stats:
> PARTIAL Column stats: COMPLETE
> Target column: ss_sold_date_sk
> Target Vertex: Map 1
> Stage: Stage-0
> Fetch Operator
> limit: -1
> Processor Tree:
> ListSink
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)