[
https://issues.apache.org/jira/browse/HIVE-29534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stamatis Zampetakis updated HIVE-29534:
---------------------------------------
Description:
StatsUtils#getColStatistics method does not fetch/update NDV and HLL stats for
DATE/TIMESTAMP column types from the metastore. As a result, the NDV/HLL
statistics for such columns is either zero or empty leading to sub-optimal
query plans.
[https://github.com/apache/hive/blob/bbd83dff5bfc8b8ce018476391469da3331216dd/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L840]
[https://github.com/apache/hive/blob/bbd83dff5bfc8b8ce018476391469da3331216dd/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L870]
Adding this info seems to change the output of about 100 .out files
was:
Technically, the method is missing stats for multiple data types. The most
important ones seem to be: setCountDistint() for DATE_TYPE_NAME and
TIMESTAMP_TYPE_NAME
The TIMESTAMP datatype could also benefit from setBitVectors(), for which the
info also appears to be available.
As the result of this, the NDV of columns of this data type is assigned a value
of 0. which could negatively impact execution planning of some queries
[https://github.com/apache/hive/blob/bbd83dff5bfc8b8ce018476391469da3331216dd/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L840]
[https://github.com/apache/hive/blob/bbd83dff5bfc8b8ce018476391469da3331216dd/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L870]
Adding this info seems to change the output of about 100 .out files
> Missing NDV and HLL stats for DATE/TIMESTAMP columns during optimization
> ------------------------------------------------------------------------
>
> Key: HIVE-29534
> URL: https://issues.apache.org/jira/browse/HIVE-29534
> Project: Hive
> Issue Type: Bug
> Reporter: Konstantin Bereznyakov
> Assignee: Konstantin Bereznyakov
> Priority: Major
> Labels: pull-request-available
> Attachments: 6406.patch
>
>
> StatsUtils#getColStatistics method does not fetch/update NDV and HLL stats
> for DATE/TIMESTAMP column types from the metastore. As a result, the NDV/HLL
> statistics for such columns is either zero or empty leading to sub-optimal
> query plans.
> [https://github.com/apache/hive/blob/bbd83dff5bfc8b8ce018476391469da3331216dd/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L840]
> [https://github.com/apache/hive/blob/bbd83dff5bfc8b8ce018476391469da3331216dd/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L870]
> Adding this info seems to change the output of about 100 .out files
--
This message was sent by Atlassian Jira
(v8.20.10#820010)