[
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705649#comment-17705649
]
ASF GitHub Bot commented on PARQUET-2261:
-----------------------------------------
emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1149682895
##########
src/main/thrift/parquet.thrift:
##########
@@ -190,6 +190,35 @@ enum FieldRepetitionType {
/** The field is repeated and can contain 0 or more values */
REPEATED = 2;
}
+/**
+ * A structure for capturing metadata for estimating the unencoded,
uncompressed size
+ * of data.
+ */
+struct SizeEstimationStatistics {
+ /**
+ * The number of logic bytes needed to store present/non-null values.
+ * Unless specified below, the computed size is the size it would take to
plain-encode the underlying
+ * physical type.
+ * Special calculations:
+ * - Enum: plain-encoded BYTE_ARRAY size
+ * - Integers (same size used for signed and unsigned): int8 - 1 bytes,
int16 - 2
+ * - Decimal - Each value is assumed to take the minimal number of bytes
necessary to encode
Review Comment:
I originally had this. I think given the two different opinions expressed,
I'm going to change this field to only record variable width bytes, and say all
other calcutions can be performed by readers based on type and number of values
> [Format] Add statistics that reflect decoded size to metadata
> -------------------------------------------------------------
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-format
> Reporter: Micah Kornfield
> Assignee: Micah Kornfield
> Priority: Major
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)