Re: [PR] parquet reader: move pruning predicate creation from ParquetSource to ParquetOpener [datafusion]

via GitHub Thu, 03 Apr 2025 21:43:27 -0700


adriangb commented on code in PR #15561:
URL: https://github.com/apache/datafusion/pull/15561#discussion_r2027668758



##########
datafusion/core/src/datasource/physical_plan/parquet.rs:
##########
@@ -1445,119 +1465,26 @@ mod tests {
         // batch1: c1(string)
         let batch1 = string_batch();
 
-        // c1 != 'bar'
-        let filter = col("c1").not_eq(lit("bar"));
+        // c1 == 'aaa', should prune via stats
+        let filter = col("c1").eq(lit("aaa"));
 
         let rt = RoundTrip::new()
             .with_predicate(filter)
             .with_pushdown_predicate()
             .round_trip(vec![batch1])
             .await;
 
-        // should have a pruning predicate
-        let pruning_predicate = rt.parquet_source.pruning_predicate();
-        assert!(pruning_predicate.is_some());
+        let explain = rt.explain.unwrap();
 
-        // convert to explain plan form
-        let display = displayable(rt.parquet_exec.as_ref())
-            .indent(true)
-            .to_string();
+        // check that there was a pruning predicate -> row groups got pruned
+        assert_contains!(&explain, "predicate=c1@0 = aaa");
 
-        assert_contains!(
-            &display,
-            "pruning_predicate=c1_null_count@2 != row_count@3 AND (c1_min@0 != 
bar OR bar != c1_max@1)"
-        );
-
-        assert_contains!(&display, r#"predicate=c1@0 != bar"#);
+        // there's a single row group, but we can check that it matched
+        // if no pruning was done this would be 0 instead of 1
+        assert_contains!(&explain, "row_groups_matched_statistics=1");
 
-        assert_contains!(&display, "projection=[c1]");
-    }
-
-    #[tokio::test]
-    async fn parquet_exec_display_deterministic() {

Review Comment:
   Right I guess the point is: now that they are no longer printed to the 
explain plan, is this test needed or can we bin it?



##########
datafusion/datasource-parquet/src/opener.rs:
##########
@@ -295,3 +312,89 @@ fn create_initial_plan(
     // default to scanning all row groups
     Ok(ParquetAccessPlan::new_all(row_group_count))
 }
+
+/// Build a pruning predicate from an optional predicate expression.
+/// If the predicate is None or the predicate cannot be converted to a pruning
+/// predicate, return None.
+/// If there is an error creating the pruning predicate it is recorded by 
incrementing
+/// the `predicate_creation_errors` counter.
+pub fn build_pruning_predicate(

Review Comment:
   Not any more, good catch



##########
datafusion/sqllogictest/test_files/parquet.slt:
##########
@@ -625,7 +625,7 @@ physical_plan
 01)CoalesceBatchesExec: target_batch_size=8192
 02)--FilterExec: column1@0 LIKE f%
 03)----RepartitionExec: partitioning=RoundRobinBatch(2), input_partitions=1
-04)------DataSourceExec: file_groups={1 group: 
[[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/foo.parquet]]},
 projection=[column1], file_type=parquet, predicate=column1@0 LIKE f%, 
pruning_predicate=column1_null_count@2 != row_count@3 AND column1_min@0 <= g 
AND f <= column1_max@1, required_guarantees=[]
+04)------DataSourceExec: file_groups={1 group: 
[[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/foo.parquet]]},
 projection=[column1], file_type=parquet, predicate=column1@0 LIKE f%

Review Comment:
   Yes I think we can do that. I feared that it would be more confusing because 
the pruning predicate you see is not what you get in the end...
   
   Is there any way we can inject this information at runtime? Metrics already 
kind of do that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] parquet reader: move pruning predicate creation from ParquetSource to ParquetOpener [datafusion]

Reply via email to