Re: [PR] Skip re-pruning based on partition values and file level stats if there are no dynamic filters [datafusion]
adriangb commented on PR #16424: URL: https://github.com/apache/datafusion/pull/16424#issuecomment-3006625465 @xudong963 @alamb I've re-organized this to incorporate https://github.com/apache/datafusion/pull/16549. Sadly I did not catch in that PR that we were putting everything in `lib.rs` which I felt now is too bloated if I put `FilePruner` in there. So I moved `PruningPredicate` & co to `pruning_predicate.rs` - hence the huge diff line count. I'll also point out that this now only does the extra work if it has either a dynamic filter OR the file has statistics already collected. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Skip re-pruning based on partition values and file level stats if there are no dynamic filters [datafusion]
adriangb commented on PR #16424: URL: https://github.com/apache/datafusion/pull/16424#issuecomment-3005207099 @alamb I added test assertions to confirm the stats are working correctly which addresses https://github.com/apache/datafusion/issues/16402 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Skip re-pruning based on partition values and file level stats if there are no dynamic filters [datafusion]
alamb commented on PR #16424: URL: https://github.com/apache/datafusion/pull/16424#issuecomment-3001652418 Also thank you @xudong963 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Skip re-pruning based on partition values and file level stats if there are no dynamic filters [datafusion]
alamb commented on code in PR #16424:
URL: https://github.com/apache/datafusion/pull/16424#discussion_r2164757786
##
datafusion/datasource-parquet/src/opener.rs:
##
@@ -524,6 +512,91 @@ fn should_enable_page_index(
.unwrap_or(false)
}
+/// Prune based on partition values and file-level statistics.
+pub struct FilePruner {
Review Comment:
- filed https://github.com/apache/datafusion/issues/16542
##
datafusion/physical-expr-common/src/physical_expr.rs:
##
@@ -345,6 +347,24 @@ pub trait PhysicalExpr: Send + Sync + Display + Debug +
DynEq + DynHash {
// This is a safe default behavior.
Ok(None)
}
+
+/// Returns the generation of this `PhysicalExpr` for snapshotting
purposes.
Review Comment:
I like this design.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Skip re-pruning based on partition values and file level stats if there are no dynamic filters [datafusion]
adriangb commented on code in PR #16424:
URL: https://github.com/apache/datafusion/pull/16424#discussion_r2161929117
##
datafusion/core/tests/parquet/row_group_pruning.rs:
##
@@ -421,7 +421,7 @@ macro_rules! int_tests {
.with_query(&format!("SELECT * FROM t where i{} in (100)",
$bits))
.with_expected_errors(Some(0))
.with_matched_by_stats(Some(0))
-.with_pruned_by_stats(Some(0))
+.with_pruned_by_stats(Some(4))
Review Comment:
This is basically reverting the changes made in
https://github.com/apache/datafusion/pull/16447 because we are now not doing
file level pruning if there are only static filters.
@alamb maybe this is an argument to still do the file level pruning even if
there are no dynamic filters. I guess this pruning is _not_ happening anywhere
else. One could argue it should happen right after statistics are collected, in
the TableProvider, but that doesn't seem to be the case and maybe it's better
to just let it happen in `ParquetOpener` (or we could move this pruning up to
the `DataSourceExec` or something so it works for all sources) so that
TableProvider complexity is kept to a minimum.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Skip re-pruning based on partition values and file level stats if there are no dynamic filters [datafusion]
adriangb commented on code in PR #16424:
URL: https://github.com/apache/datafusion/pull/16424#discussion_r2157631212
##
datafusion/datasource-parquet/src/opener.rs:
##
@@ -524,6 +498,99 @@ fn should_enable_page_index(
.unwrap_or(false)
}
+/// Prune based on partition values and file-level statistics.
+pub struct FilePruner {
+predicate_generation: u64,
+predicate: Arc,
+/// Schema used for pruning, which combines the file schema and partition
fields.
+/// Partition fields are always at the end, as they are during scans.
+pruning_schema: Arc,
+file: PartitionedFile,
+partition_fields: Vec,
+predicate_creation_errors: Count,
+}
+
+impl FilePruner {
+pub fn new(
+predicate: Arc,
+logical_file_schema: &SchemaRef,
+partition_fields: Vec,
+file: PartitionedFile,
+predicate_creation_errors: Count,
+) -> Result {
+// Build a pruning schema that combines the file fields and partition
fields.
+// Partition fileds are always at the end.
+let pruning_schema = Arc::new(
+Schema::new(
+logical_file_schema
+.fields()
+.iter()
+.cloned()
+.chain(partition_fields.iter().cloned())
+.collect_vec(),
+)
+.with_metadata(logical_file_schema.metadata().clone()),
+);
+Ok(Self {
+// Initialize the predicate generation to 0 so that the first time
we call `should_prune` we actually check the predicate
Review Comment:
```suggestion
// Initialize the predicate generation to 0 so that the first
time we call `should_prune` we actually check the predicate
// This also means that no pruning will happen unless there is a
dynamic filter present.
// See [`snapshot_generation`] for more info.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Skip re-pruning based on partition values and file level stats if there are no dynamic filters [datafusion]
adriangb commented on code in PR #16424:
URL: https://github.com/apache/datafusion/pull/16424#discussion_r2157629376
##
datafusion/datasource-parquet/src/opener.rs:
##
@@ -524,6 +512,91 @@ fn should_enable_page_index(
.unwrap_or(false)
}
+/// Prune based on partition values and file-level statistics.
+pub struct FilePruner {
Review Comment:
Shall I make a `datafusion-pruning` crate? I guess it will have all of the
same deps as `datasource-parquet` sans the parquet specific bits.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Skip re-pruning based on partition values and file level stats if there are no dynamic filters [datafusion]
adriangb commented on PR #16424: URL: https://github.com/apache/datafusion/pull/16424#issuecomment-2989096200 @alamb I reverted the filtering during the stream so this should now do strictly less work 😄 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
