Chao Sun created HIVE-15796:
-------------------------------
Summary: HoS: poor reducer parallelism when operator stats are not
accurate
Key: HIVE-15796
URL: https://issues.apache.org/jira/browse/HIVE-15796
Project: Hive
Issue Type: Improvement
Components: Statistics
Affects Versions: 2.2.0
Reporter: Chao Sun
Assignee: Chao Sun
In HoS we use currently use operator stats to determine reducer parallelism.
However, it is often the case that operator stats are not accurate, especially
if column stats are not available. This sometimes will generate extremely poor
reducer parallelism, and cause HoS query to run forever.
This JIRA tries to offer an alternative way to compute reducer parallelism,
similar to how MR does. Here's the approach we are suggesting:
1. when computing the parallelism for a MapWork, use stats associated with the
TableScan operator;
2. when computing the parallelism for a ReduceWork, use the *maximum*
parallelism from all its parents.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)