wgtmac commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1151311588


##########
src/main/thrift/parquet.thrift:
##########
@@ -190,6 +190,41 @@ enum FieldRepetitionType {
   /** The field is repeated and can contain 0 or more values */
   REPEATED = 2;
 }
+/**
+ * A structure for capturing metadata for estimating the unencoded, 
uncompressed size
+ * of data.
+ *
+ * Writers should populate all fields in this struct except for the exceptions 
listed per field.
+ */ 
+struct SizeEstimationStatistics {
+   /** 
+    * The number of logical physical bytes stored for BYTE_ARRAY data values. 
Logical bytes refers to the number
+    * of bytes needed if no special encoding is used. This is exclusive of the 
bytes needed
+    * to store the length of each byte array. In other words, this field is 
equivelant to the the (size of 
+    * PLAIN-ENCODING the byte array values) - (4 bytes * number of values 
written). To determine logical sizes 
+    * of other other types readers can use schema information multiplied by 
the number of non-null values.
+    * The number of non-null values can be inferred from the histograms below.
+    *
+    * For example if column chunk is dictionary encoded with a dictionary 
["a", "bc", "cde"] and a data page 
+    * has indexes [0, 0, 1, 2].  This value is expected to be 7 (1 + 1 + 2 + 
3).
+    *
+    * This option should only be set for physical and logical types that would 
use BYTE_ARRAY when encoded with PLAIN encoding.
+    */
+   1: optional i64 logical_variable_width_stored_bytes;
+   /** 
+     * When present there is expected to be one element corresponding to each 
repetition (i.e. size=max repetition_level+1) 
+     * where each element represens the number of time the repetition level 
was observed in the data.
+     *
+     * This value is optional if max_repetition_level is 0.
+     */
+   2: optional list<i64> repetition_level_histogram;
+   /**
+    * Same as  repetition_level_histogram except for definition levels.
+    *
+    * This value is optional when max_definition_level is 0. 
+    */ 
+   3: optional list<i64> definition_level_histogram;

Review Comment:
   I am thinking of supporting pushing down filters like `IS_NULL` or 
`IS_NOT_NULL` to nested fields. So I want to make sure if this can satisfy the 
use case. Maybe we don't need precise null_count of each level but it would be 
great to answer yes or no to the filters above. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to