Re: [PR] Skip re-pruning based on partition values and file level stats if there are no dynamic filters [datafusion]

2025-06-25 Thread via GitHub


adriangb commented on PR #16424:
URL: https://github.com/apache/datafusion/pull/16424#issuecomment-3006625465

   @xudong963 @alamb I've re-organized this to incorporate 
https://github.com/apache/datafusion/pull/16549.
   
   Sadly I did not catch in that PR that we were putting everything in `lib.rs` 
which I felt now is too bloated if I put `FilePruner` in there. So I moved 
`PruningPredicate` & co to `pruning_predicate.rs` - hence the huge diff line 
count.
   
   I'll also point out that this now only does the extra work if it has either 
a dynamic filter OR the file has statistics already collected.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Skip re-pruning based on partition values and file level stats if there are no dynamic filters [datafusion]

2025-06-25 Thread via GitHub


adriangb commented on PR #16424:
URL: https://github.com/apache/datafusion/pull/16424#issuecomment-3005207099

   @alamb I added test assertions to confirm the stats are working correctly 
which addresses https://github.com/apache/datafusion/issues/16402


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Skip re-pruning based on partition values and file level stats if there are no dynamic filters [datafusion]

2025-06-24 Thread via GitHub


alamb commented on PR #16424:
URL: https://github.com/apache/datafusion/pull/16424#issuecomment-3001652418

   Also thank you @xudong963 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Skip re-pruning based on partition values and file level stats if there are no dynamic filters [datafusion]

2025-06-24 Thread via GitHub


alamb commented on code in PR #16424:
URL: https://github.com/apache/datafusion/pull/16424#discussion_r2164757786


##
datafusion/datasource-parquet/src/opener.rs:
##
@@ -524,6 +512,91 @@ fn should_enable_page_index(
 .unwrap_or(false)
 }
 
+/// Prune based on partition values and file-level statistics.
+pub struct FilePruner {

Review Comment:
   - filed https://github.com/apache/datafusion/issues/16542



##
datafusion/physical-expr-common/src/physical_expr.rs:
##
@@ -345,6 +347,24 @@ pub trait PhysicalExpr: Send + Sync + Display + Debug + 
DynEq + DynHash {
 // This is a safe default behavior.
 Ok(None)
 }
+
+/// Returns the generation of this `PhysicalExpr` for snapshotting 
purposes.

Review Comment:
   I like this design. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Skip re-pruning based on partition values and file level stats if there are no dynamic filters [datafusion]

2025-06-23 Thread via GitHub


adriangb commented on code in PR #16424:
URL: https://github.com/apache/datafusion/pull/16424#discussion_r2161929117


##
datafusion/core/tests/parquet/row_group_pruning.rs:
##
@@ -421,7 +421,7 @@ macro_rules! int_tests {
 .with_query(&format!("SELECT * FROM t where i{} in (100)", 
$bits))
 .with_expected_errors(Some(0))
 .with_matched_by_stats(Some(0))
-.with_pruned_by_stats(Some(0))
+.with_pruned_by_stats(Some(4))

Review Comment:
   This is basically reverting the changes made in 
https://github.com/apache/datafusion/pull/16447 because we are now not doing 
file level pruning if there are only static filters.
   
   @alamb maybe this is an argument to still do the file level pruning even if 
there are no dynamic filters. I guess this pruning is _not_ happening anywhere 
else. One could argue it should happen right after statistics are collected, in 
the TableProvider, but that doesn't seem to be the case and maybe it's better 
to just let it happen in `ParquetOpener` (or we could move this pruning up to 
the `DataSourceExec` or something so it works for all sources) so that 
TableProvider complexity is kept to a minimum.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Skip re-pruning based on partition values and file level stats if there are no dynamic filters [datafusion]

2025-06-19 Thread via GitHub


adriangb commented on code in PR #16424:
URL: https://github.com/apache/datafusion/pull/16424#discussion_r2157631212


##
datafusion/datasource-parquet/src/opener.rs:
##
@@ -524,6 +498,99 @@ fn should_enable_page_index(
 .unwrap_or(false)
 }
 
+/// Prune based on partition values and file-level statistics.
+pub struct FilePruner {
+predicate_generation: u64,
+predicate: Arc,
+/// Schema used for pruning, which combines the file schema and partition 
fields.
+/// Partition fields are always at the end, as they are during scans.
+pruning_schema: Arc,
+file: PartitionedFile,
+partition_fields: Vec,
+predicate_creation_errors: Count,
+}
+
+impl FilePruner {
+pub fn new(
+predicate: Arc,
+logical_file_schema: &SchemaRef,
+partition_fields: Vec,
+file: PartitionedFile,
+predicate_creation_errors: Count,
+) -> Result {
+// Build a pruning schema that combines the file fields and partition 
fields.
+// Partition fileds are always at the end.
+let pruning_schema = Arc::new(
+Schema::new(
+logical_file_schema
+.fields()
+.iter()
+.cloned()
+.chain(partition_fields.iter().cloned())
+.collect_vec(),
+)
+.with_metadata(logical_file_schema.metadata().clone()),
+);
+Ok(Self {
+// Initialize the predicate generation to 0 so that the first time 
we call `should_prune` we actually check the predicate

Review Comment:
   ```suggestion
   // Initialize the predicate generation to 0 so that the first 
time we call `should_prune` we actually check the predicate
   // This also means that no pruning will happen unless there is a 
dynamic filter present.
   // See [`snapshot_generation`] for more info.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Skip re-pruning based on partition values and file level stats if there are no dynamic filters [datafusion]

2025-06-19 Thread via GitHub


adriangb commented on code in PR #16424:
URL: https://github.com/apache/datafusion/pull/16424#discussion_r2157629376


##
datafusion/datasource-parquet/src/opener.rs:
##
@@ -524,6 +512,91 @@ fn should_enable_page_index(
 .unwrap_or(false)
 }
 
+/// Prune based on partition values and file-level statistics.
+pub struct FilePruner {

Review Comment:
   Shall I make a `datafusion-pruning` crate? I guess it will have all of the 
same deps as `datasource-parquet` sans the parquet specific bits.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Skip re-pruning based on partition values and file level stats if there are no dynamic filters [datafusion]

2025-06-19 Thread via GitHub


adriangb commented on PR #16424:
URL: https://github.com/apache/datafusion/pull/16424#issuecomment-2989096200

   @alamb I reverted the filtering during the stream so this should now do 
strictly less work 😄 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]