[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

via GitHub Thu, 07 Sep 2023 10:38:31 -0700


emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1318930925



##########
src/main/thrift/parquet.thrift:
##########
@@ -191,6 +191,73 @@ enum FieldRepetitionType {
   REPEATED = 2;
 }
 
+/**
+  * A histogram of repetition and definition levels for either a page or column
+  * chunk.
+  *
+  * This is useful for:
+  *   1. Estimating the size of the data when materialized in memory
+  *
+  *   2. For filter push-down on nulls at various levels of nested
+  *   structures and list lengths.
+  */
+struct RepetitionDefinitionLevelHistogram {
+   /**
+    * When present, there is expected to be one element corresponding to each
+    * repetition (i.e. size=max repetition_level+1) where each element
+    * represents the number of times the repetition level was observed in the
+    * data.
+    *
+    * This field may be omitted if max_repetition_level is 0.
+    **/
+   1: optional list<i64> repetition_level_histogram;
+   /**
+    * Same as repetition_level_histogram except for definition levels.
+    *
+    * This field may be omitted if max_definition_level is 0 or 1.
+    **/
+   2: optional list<i64> definition_level_histogram;
+ }
+
+/**
+ * A structure for capturing metadata for estimating the unencoded,
+ * uncompressed size of data written. This is useful for readers to estimate
+ * how much memory is needed to reconstruct data in their memory model and for
+ * fine grained filter pushdown on nested structures (the histogram contained
+ * in this structure can help determine the number of nulls at a particular
+ * nesting level).
+ *
+ * Writers should populate all fields in this struct except for the exceptions
+ * listed per field.
+ */
+struct SizeStatistics {
+   /**
+    * The number of physical bytes stored for BYTE_ARRAY data values assuming
+    * no encoding. This is exclusive of the bytes needed to store the length of
+    * each byte array. In other words, this field is equivalent to the `(size
+    * of PLAIN-ENCODING the byte array values) - (4 bytes * number of values
+    * written)`. To determine unencoded sizes of other types readers can use
+    * schema information multiplied by the number of non-null and null values.
+    * The number of null/non-null values can be inferred from the histograms
+    * below.
+    *
+    * For example, if a column chunk is dictionary-encoded with dictionary
+    * ["a", "bc", "cde"], and a data page contains the indices [0, 0, 1, 2],
+    * then this value for that data page should be 7 (1 + 1 + 2 + 3).
+    *
+    * This field should only be set for types that use BYTE_ARRAY as their
+    * physical type.
+    */
+   1: optional i64 unencoded_byte_array_data_bytes;

Review Comment:
   I think the main point touched on already by a few people but to made 
explicit is that there is no reader of parquet data that I'm aware of that 
keeps the parquet encoded format entirely for its memory model.  Some may 
support a few options here (e.g. as noted arrow supported dictionary and 
run-end encoding) but more generally there are many readers that transpose 
early to a row format (avro or parquet) and in those contexts having a final 
memory estimate can improve planning.   Even for readers that support 
dictionary encoding, it is not guaranteed that all pages will be dictionary 
encoded (i.e. dictionary grows too large and there is fallback).  In these 
cases being able to get a good estimate across all pages is useful.
   
   Given no universal encoding knowing an upper bound on total size helps 
readers plan on a batch size to use in the face of memory memory pressure by 
calculating an estimated row size.
   
   Even if there was an immediate reader that all parquet encodings that there 
isn't any massaging, often times data is transposed later in query processing 
to a row oriented format which often times will be "plain encoded".  For 
example for joins.  Having a good memory estimate here can help for join 
planning (especially for distributed joins).  
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

Reply via email to