David Mollitor created HIVE-23054:
-------------------------------------
Summary: Capture Total Byte Size in Column Statistics
Key: HIVE-23054
URL: https://issues.apache.org/jira/browse/HIVE-23054
Project: Hive
Issue Type: Improvement
Components: CBO, Statistics
Reporter: David Mollitor
Store a counter in HMS column statics for the total number of bytes (raw) in
each column.
Right now, there is no good way to merge the average column length when
performing an INSERT statement into a table. Right now, the code just selects
the maximum value, however, if inserting a single records with a long length
(128 bytes) into a table that has millions of strings with an average length of
4, the average length for the entire data set gets boosted to 128.
{code:java}
aggregateData.setAvgColLen(Math.max(aggregateData.getAvgColLen(),
newData.getAvgColLen()));
{code}
https://github.com/apache/hive/blob/e182d9ce6c09136d13ee889ef069b202f60052ec/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/columnstats/merge/StringColumnStatsMerger.java#L34
Store the total raw size of all the data in each column. Between the total raw
size, and the average length, one can compute the real average length when
merging the exiting data and the newly inserted data.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)