[ 
https://issues.apache.org/jira/browse/HIVE-9560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302899#comment-14302899
 ] 

Prasanth Jayachandran commented on HIVE-9560:
---------------------------------------------

Try using
{code}
analyze table TABLE_NAME compute statistics noscan
{code}
OR
{code}
analyze table TABLE_NAME compute statistics partialscan
{code}

This should get the raw data size properly. The reason why 'analyze table 
TABLE_NAME compute statistics' does not work in case of ORC is, ORC does not 
implement the SerDeStats which some formats implement. Implementing SerDeStats 
the traditional way requires ORC to pass serialized data size for each row. 
This is inefficient considering scanning of each row, getting stats and 
aggregating it. Since ORC already collects column stats we can utilize that 
information without scanning each row to compute the raw data size. Thats the 
reason we need noscan/partialscan at the end (both does the same).

> When hive.stats.collect.rawdatasize=true, 'rawDataSize' for an ORC table will 
> result in value '0' after running 'analyze table TABLE_NAME compute 
> statistics;'
> --------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-9560
>                 URL: https://issues.apache.org/jira/browse/HIVE-9560
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Xin Hao
>
> When hive.stats.collect.rawdatasize=true, 'rawDataSize' for an ORC table will 
> result in value '0' after running 'analyze table TABLE_NAME compute 
> statistics;'
> Reproduce step:
> (1) set hive.stats.collect.rawdatasize=true;
> (2) Generate an ORC table in hive, and the value of its 'rawDataSize' is NOT 
> zero.
> You can find the value of 'rawDataSize' (NOT zero) by executing  'describe 
> extended TABLE_NAME;' 
> (4) Execute 'analyze table TABLE_NAME compute statistics;'
> (5) Execute  'describe extended TABLE_NAME;' again, and you will find that  
> the value of 'rawDataSize' will be changed to '0'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to