[
https://issues.apache.org/jira/browse/ORC-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372004#comment-16372004
]
Xiening Dai commented on ORC-305:
---------------------------------
Would you consider adding column raw data size (the size before encoding and
compression) as well? This would be useful in a couple of scenarios. For
example, query optimizer can use it to estimate input data size, and thus
decides degrees of parallelism, join algorithms, etc. Column raw size can be
deduced from number of value (or total length from string column). We just need
to expose a new interface from ColumnStatistics.
> Add column statistics for the size on disk
> ------------------------------------------
>
> Key: ORC-305
> URL: https://issues.apache.org/jira/browse/ORC-305
> Project: ORC
> Issue Type: Test
> Reporter: Owen O'Malley
> Assignee: Sandeep More
> Priority: Major
>
> It would be great to have the size on disk of each column.
> You can generate this by adding up the sizes of the dictionary and data
> streams.
> It is only relevant at the stripe and file level.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)