[ 
https://issues.apache.org/jira/browse/HIVE-17108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094128#comment-16094128
 ] 

liyunzhang_intel commented on HIVE-17108:
-----------------------------------------

the detail reason why parquet file does not gather statistic such as "RAW DATA 
SIZE" automatically:
when executing "INSERT OVERWRITE TABLE xxx SELECT * xxx",
hive with orc will update statistics from orc footer in 
[FileSinkOperator#closeOp|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L1060]
 while hive with parquet will not. 
OrcRecordWriter implements StatsProvidingRecordWriter.
ParquetRecordWriterWrapper not implements StatsProvidingRecordWriter.

But i guess even ParquetRecordWriterWrapper implements 
[StatsProvidingRecordWriter|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/StatsProvidingRecordWriter.java],
 statistics like "RAW DATA SIZE" can not be updated because 
org.apache.parquet.hadoop.ParquetWriter does not provide interface like 
getRawDataSize() or getRawCount().

> Parquet file does not gather statistic such as "RAW DATA SIZE" automatically 
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-17108
>                 URL: https://issues.apache.org/jira/browse/HIVE-17108
>             Project: Hive
>          Issue Type: Bug
>            Reporter: liyunzhang_intel
>
> in 
> [parquet_analyze.q|https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/parquet_analyze.q#L27],
>  we need run "ANALYZE TABLE parquet_create_people COMPUTE STATISTICS noscan" 
> to update the statistic. 
> In 
> [orc_analyze.q|https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/orc_analyze.q#L45],
>  we need not do that if we set hive.stats.autogather as true.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to