George Pachitariu created HIVE-20523:
----------------------------------------
Summary: Improve table statistics when the table contains arrays
Key: HIVE-20523
URL: https://issues.apache.org/jira/browse/HIVE-20523
Project: Hive
Issue Type: Improvement
Components: Physical Optimizer
Reporter: George Pachitariu
Assignee: George Pachitariu
By default, when the table contains table-stats, the value of *rawDataSize* is
taken to estimate the table data size in the execution plan.
The problem is that rawDataSize does not contain the data size of arrays. This
makes the table size be underestimated when arrays make most of the table size.
In those specific cases, the value of the *totalSize* is much closer to the
truth.
In this task I propose to take the max value between *rawDataSize* and
*totalSize*deserializationFactor*.
I don't know if this proposal will backfire in any specific cases
(overestimating the size of tables).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)