[ 
https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16773098#comment-16773098
 ] 

Sahil Takiar commented on HIVE-20079:
-------------------------------------

How does ORC handle this? Is thereĀ a fundamental reason we can't mimic the same 
thing they are doing? Getting things to be consistent with how ORC handles this 
makes more sense to me than implementing two different approaches for ORC vs. 
Parquet and ending up with an inconsistent definition of {{rawDataSize}} 
depending on the file format. Sure, this patch is probably a better estimation 
so I see no reason to not proceed with it.

> Populate more accurate rawDataSize for parquet format
> -----------------------------------------------------
>
>                 Key: HIVE-20079
>                 URL: https://issues.apache.org/jira/browse/HIVE-20079
>             Project: Hive
>          Issue Type: Improvement
>          Components: File Formats
>    Affects Versions: 2.0.0
>            Reporter: Aihua Xu
>            Assignee: Antal Sinkovits
>            Priority: Major
>         Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, 
> HIVE-20079.3.patch
>
>
> Run the following queries and you will see the raw data for the table is 4 
> (that is the number of fields) incorrectly. We need to populate correct data 
> size so data can be split properly.
> {noformat}
> SET hive.stats.autogather=true;
> CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET;
> INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1');
> DESC FORMATTED parquet_stats;
> {noformat}
> {noformat}
> Table Parameters:
>       COLUMN_STATS_ACCURATE   true
>       numFiles                1
>       numRows                 2
>       rawDataSize             4
>       totalSize               373
>       transient_lastDdlTime   1530660523
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to