[ 
https://issues.apache.org/jira/browse/HIVE-29534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stamatis Zampetakis updated HIVE-29534:
---------------------------------------
    Description: 
StatsUtils#getColStatistics method does not fetch/update NDV and HLL stats for 
DATE/TIMESTAMP column types from the metastore. As a result, the NDV/HLL 
statistics for such columns is either zero or empty leading to sub-optimal 
query plans.

[https://github.com/apache/hive/blob/bbd83dff5bfc8b8ce018476391469da3331216dd/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L840]
[https://github.com/apache/hive/blob/bbd83dff5bfc8b8ce018476391469da3331216dd/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L870]

Adding this info seems to change the output of about 100 .out files

  was:
Technically, the method is missing stats for multiple data types. The most 
important ones seem to be: setCountDistint() for DATE_TYPE_NAME and 
TIMESTAMP_TYPE_NAME
The TIMESTAMP datatype could also benefit from setBitVectors(), for which the 
info also appears to be available.

As the result of this, the NDV of columns of this data type is assigned a value 
of 0. which could negatively impact execution planning of some queries

[https://github.com/apache/hive/blob/bbd83dff5bfc8b8ce018476391469da3331216dd/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L840]

[https://github.com/apache/hive/blob/bbd83dff5bfc8b8ce018476391469da3331216dd/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L870]

Adding this info seems to change the output of about 100 .out files


> Missing NDV and HLL stats for DATE/TIMESTAMP columns during optimization
> ------------------------------------------------------------------------
>
>                 Key: HIVE-29534
>                 URL: https://issues.apache.org/jira/browse/HIVE-29534
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Konstantin Bereznyakov
>            Assignee: Konstantin Bereznyakov
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: 6406.patch
>
>
> StatsUtils#getColStatistics method does not fetch/update NDV and HLL stats 
> for DATE/TIMESTAMP column types from the metastore. As a result, the NDV/HLL 
> statistics for such columns is either zero or empty leading to sub-optimal 
> query plans.
> [https://github.com/apache/hive/blob/bbd83dff5bfc8b8ce018476391469da3331216dd/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L840]
> [https://github.com/apache/hive/blob/bbd83dff5bfc8b8ce018476391469da3331216dd/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L870]
> Adding this info seems to change the output of about 100 .out files



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to