[ 
https://issues.apache.org/jira/browse/HIVE-5324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13781519#comment-13781519
 ] 

Prasanth J commented on HIVE-5324:
----------------------------------

I do not understand your point about making this as property of object. Can you 
please elaborate more on that?
I don't know if other formats (like RCFile, Avro etc.) gather file/column 
statistics. If it gathers statistics then there is duplication of effort in the 
existing code base (File format gathers statistics and FileSinkOperator 
operator also gathers statistics). This newly added interface tries to avoid 
that. If the file format handles/gathers the statistics then its much easier to 
directly get statistics from record writer than computing statistics for every 
row in FSop::processOp(). Also the stats publishing part will remain 
unaffected. Stats publishing happens only after stats gathering/aggregation. So 
this interface will not have any impact on the stats that is being published on 
to the metastore. This feature will be disabled if hive.stats.autogather is set 
to false.. Since this statistics gathering is relatively a cheap operation (as 
formats like ORC collect statistics by default), we can have this feature 
enabled by default as well. 

> Extend record writer and ORC reader/writer interfaces to provide statistics
> ---------------------------------------------------------------------------
>
>                 Key: HIVE-5324
>                 URL: https://issues.apache.org/jira/browse/HIVE-5324
>             Project: Hive
>          Issue Type: New Feature
>    Affects Versions: 0.13.0
>            Reporter: Prasanth J
>            Assignee: Prasanth J
>              Labels: orcfile, statistics
>             Fix For: 0.13.0
>
>         Attachments: HIVE-5324.1.patch.txt, HIVE-5324.2.patch.txt, 
> HIVE-5324.3.patch.txt, HIVE-5324.4.patch.txt
>
>
> The current implementation for computing statistics (number of rows and raw 
> data size) happens for every single row processed. The processOp() method in 
> FileSinkOperator gets raw data size for each row from the serde and 
> accumulates the size in hashmap while counting the number of rows. This 
> accumulated statistics is then published to metastore. 
> In case of ORC, ORC already stores enough statistics internally which can be 
> made use of when publishing the stats to metastore. This will avoid the 
> duplication of work that is happening in the processOp(). Also getting the 
> statistics directly from ORC is very cheap (can directly read from the file 
> footer).



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to