[jira] [Commented] (ORC-305) Add column statistics for the size on disk

Xiening Dai (JIRA) Wed, 21 Feb 2018 13:01:17 -0800

    [ 
https://issues.apache.org/jira/browse/ORC-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372004#comment-16372004
 ]


Xiening Dai commented on ORC-305:
---------------------------------

Would you consider adding column raw data size (the size before encoding and 
compression) as well? This would be useful in a couple of scenarios. For 
example, query optimizer can use it to estimate input data size, and thus 
decides degrees of parallelism, join algorithms, etc. Column raw size can be 
deduced from number of value (or total length from string column). We just need 
to expose a new interface from ColumnStatistics.

> Add column statistics for the size on disk
> ------------------------------------------
>
>                 Key: ORC-305
>                 URL: https://issues.apache.org/jira/browse/ORC-305
>             Project: ORC
>          Issue Type: Test
>            Reporter: Owen O'Malley
>            Assignee: Sandeep More
>            Priority: Major
>
> It would be great to have the size on disk of each column.
> You can generate this by adding up the sizes of the dictionary and data 
> streams.
> It is only relevant at the stripe and file level.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ORC-305) Add column statistics for the size on disk

Reply via email to