Re: [PR] WIP Prototype DataPage extraction API [datafusion]

via GitHub Sun, 09 Jun 2024 21:25:22 -0700


marvinlanhenke commented on code in PR #10843:
URL: https://github.com/apache/datafusion/pull/10843#discussion_r1632556874



##########
datafusion/core/src/datasource/physical_plan/parquet/statistics.rs:
##########
@@ -766,10 +769,108 @@ impl<'a> StatisticsConverter<'a> {
         Ok(Arc::new(UInt64Array::from_iter(null_counts)))
     }
 
-    /// Returns a null array of data_type with one element per row group
-    fn make_null_array<I>(&self, data_type: &DataType, metadatas: I) -> 
ArrayRef
+    /// Extract the minimum values from Data page statistics
+    ///
+    /// In Parquet files, in addition to the Column Chunk level statistics
+    /// (stored for each column for each row group) there are also optional
+    /// statistics stored for each data page, part of the [Parquet Page Index].
+    /// Since a single Column Chunk is stored as one or more pages, page level 
statistics
+    /// can prune at a finer granularity.
+    ///
+    /// However since they are stored in a separate metadata structure
+    /// ([`Index`]) there is different code to extract them as arrow statistics
+    ///
+    /// Parameters:
+    ///
+    /// * `page_index`: The parquet page index, likely read from
+    /// [`ParquetMetadata::page_index()`]
+    ///
+    /// * row_group_indexes: The indexes of the row groups (indexes in
+    /// `page_index`) to extract the statistics from. This is an interator 
over `&usize` to
+    /// permit passing in  `&Vec<usize>` or similar
+    ///
+    /// # Return Value
+    ///
+    /// The returned array contains 1 value for each `NativeIndex` in the 
underlying
+    /// `Index`es, in the same order as they appear in `metadatas`.
+    ///
+    /// For example, if there are two `Index`es in `metadatas`:
+    /// 1. the first having `3` `PageIndex` entries
+    /// 2. the second having `2` `PageIndex` entries
+    ///
+    /// The returned array would have 5 rows
+    ///
+    /// Each value is either
+    /// * the minimum value for the page
+    /// * a null value, if the statistics can not be extracted
+    ///
+    /// Note that a null value does NOT mean the min value was actually
+    /// `null` it means it the requested statistic is unknown
+    ///
+    /// # Errors
+    ///
+    /// Reasons for not being able to extract the statistics include:
+    /// * the column is not present in the parquet file
+    /// * statistics for the pages are not present in the row group
+    /// * the stored statistic value can not be converted to the requested type
+    ///
+    /// # Example
+    /// ```no_run
+    /// tood
+    /// ```
+    pub fn data_page_mins<I>(
+        &self,
+        page_index: &ParquetColumnIndex,
+        row_group_indexes: I,

Review Comment:
   I'll guess one reason why we want to pass in the `row_group_indexes` is due 
to the iteration over the row_group_indexes from the access_plan 
[here](https://github.com/apache/datafusion/blob/main/datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs#L168).
   
   We cannot assume we need all indices since access_plan does [filter 
](https://github.com/apache/datafusion/blob/main/datafusion/core/src/datasource/physical_plan/parquet/access_plan.rs#L299-L307)
 based on `should_scan()` or not. 
   Is this correct? If it is, then this was the missing piece in my prototype.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] WIP Prototype DataPage extraction API [datafusion]

Reply via email to