[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16773048#comment-16773048 ]
Antal Sinkovits edited comment on HIVE-20079 at 2/20/19 2:35 PM: ----------------------------------------------------------------- [~stakiar] I'm afraid, thats not an option, as it will cause discrepancy between the calculations. Check my comment here: https://issues.apache.org/jira/browse/HIVE-20523 >From the parquet docs it says: TotalByteSize: Total byte size of all the uncompressed column data in this row group There might be some overhead when its loaded into the hash table, but it's still a better estimate, than the current one, which estimates 1 byte per column. And more important, the estimation is consistent. was (Author: asinkovits): [~stakiar] I'm afraid, thats not an option, as it will cause discrepancy between the calculations. Check my comment here: https://issues.apache.org/jira/browse/HIVE-20523 > Populate more accurate rawDataSize for parquet format > ----------------------------------------------------- > > Key: HIVE-20079 > URL: https://issues.apache.org/jira/browse/HIVE-20079 > Project: Hive > Issue Type: Improvement > Components: File Formats > Affects Versions: 2.0.0 > Reporter: Aihua Xu > Assignee: Antal Sinkovits > Priority: Major > Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, > HIVE-20079.3.patch > > > Run the following queries and you will see the raw data for the table is 4 > (that is the number of fields) incorrectly. We need to populate correct data > size so data can be split properly. > {noformat} > SET hive.stats.autogather=true; > CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET; > INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1'); > DESC FORMATTED parquet_stats; > {noformat} > {noformat} > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles 1 > numRows 2 > rawDataSize 4 > totalSize 373 > transient_lastDdlTime 1530660523 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)