[
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705161#comment-17705161
]
ASF GitHub Bot commented on PARQUET-2261:
-----------------------------------------
emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148748310
##########
src/main/thrift/parquet.thrift:
##########
@@ -223,6 +223,17 @@ struct Statistics {
*/
5: optional binary max_value;
6: optional binary min_value;
+ /** The number of bytes the row/group or page would take if encoded with
plain-encoding */
+ 7: optional i64 plain_encoded_bytes;
Review Comment:
> We need to look at different levels of metadata or even perform some
computation to gather the information required above. So my point is to write
the raw size info for every data type (with logical type considered) and
store/aggregate them into page and column-chunk levels (or even file level?).
That would make life easier as the time spent in the planning stage is critical
to some analytics use cases.
@wgtmac would the following changes suffice to address your concerns:
1. Change the name of the fields to `logical_stored_value_bytes` and define
the byte count for each logical type (for Decimal, I'd propose using the
underlying size of what it would take to use plain-encoding, for BYTE_ARRAY in
this case, for consistency I think this means for BYTE_ARRAY we should also use
the amount of space PLAIN_ENCODING would take).
2. Extract the three fields into a new struct something
like:`SizeEstimationStatistics`.
3. In addition to placing this struct into Statistics (which takes care of
column level and page level) stats, also put it onto RowGroup? I'd hesitate to
put it at the file level because this seems out of character with other
metadata) and summing across row groups should be lightweight compared to the
overhead of parsing the FileMetadata anyways?
4. (Optional) If we were really concerned about optimizations we could
convert the histogram to cumulative distribution function, which would avoid
summing to get leaf-nulls.
> [Format] Add statistics that reflect decoded size to metadata
> -------------------------------------------------------------
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-format
> Reporter: Micah Kornfield
> Assignee: Micah Kornfield
> Priority: Major
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)