Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]
theirix commented on PR #16325: URL: https://github.com/apache/datafusion/pull/16325#issuecomment-3633724002 Closing in favour of #17843 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]
theirix closed pull request #16325: Support data source sampling with TABLESAMPLE URL: https://github.com/apache/datafusion/pull/16325 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]
killme2008 commented on PR #16325: URL: https://github.com/apache/datafusion/pull/16325#issuecomment-3413687171 Great work! May I ask what's the status of this PR? Looking forward to using it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]
theirix commented on PR #16325: URL: https://github.com/apache/datafusion/pull/16325#issuecomment-3415252972 > Great work! May I ask what's the status of this PR? Looking forward to using it. Thank you! The more modern and extensible approach is proposed in my PR here https://github.com/apache/datafusion/pull/17633 , and it should be a way to go with table sampling support -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]
theirix commented on PR #16325: URL: https://github.com/apache/datafusion/pull/16325#issuecomment-3311180917 Unstale for a reference in the improved PR (sampling via extension) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]
alamb commented on PR #16325: URL: https://github.com/apache/datafusion/pull/16325#issuecomment-3196876280 > We can consider implementing a `LIMIT` either for the approach in this PR, or for a possible physical plan operator. I could work on an operator implementation if this approach doesn't look flexible enough for maintainers. @2010YOUY01, @alamb , what do you think? Sorry for my late reply -- I wrote up some thoughts here: - https://github.com/apache/datafusion/issues/13563#issuecomment-3196874378 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]
theirix commented on PR #16325: URL: https://github.com/apache/datafusion/pull/16325#issuecomment-3185015782 Thank you so much, @aditanase! Your `LIMIT` idea looks useful and should bring performance improvements. There are numerous advanced use cases and possible data source-level optimisations for table sampling. To fully support them, the executor should be provided with all the information about sampling. This PR focuses on a simple filter propagation. As @milenkovicm suggested, introducing a special operator is more flexible and preserves sampling semantics, which could help implement such features. We can consider implementing a `LIMIT` either for the approach in this PR, or for a possible physical plan operator. I could work on an operator implementation if this approach doesn't look flexible enough for maintainers. @2010YOUY01, @alamb , what do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]
aditanase commented on PR #16325: URL: https://github.com/apache/datafusion/pull/16325#issuecomment-3184147237 @theirix many Thanks for this PR! The `SYSTEM` sampling strategy is really useful for large tables, and I've seen more interesting variants on this in the ray project, where shuffling input data is a key requirement for training robust models: https://docs.ray.io/en/latest/data/shuffling-data.html At our company we're heavy users of delta table format and we're using `delta-rs` to consume them. In that scenario, you can do file pruning (partitions, statistics) and limit pushdowns from the metadata layer, which happens as part of predicate pushdown and limit pushdown. I've recently authored a PR that enables LIMIT pushdown for partition predicates for delta tables: https://github.com/delta-io/delta-rs/pull/3436/commits/15f2ade11c8627173bfca6568c9f3f6f2dd6c619 Have you considered this use case? Wondering what we would need to do at this stage to pass in enough information in the datasource to turn the SYSTEM hint in a metadata operation and keep other optimizations alive. In the delta-rs case we could simply randomize the order of active files before pruning them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]
theirix commented on PR #16325: URL: https://github.com/apache/datafusion/pull/16325#issuecomment-3137673146 > @2010YOUY01 I'd like to double-check if a volatile filter pushdown to a Parquet executor is expected. In the mentioned PR, I disabled optimisation in a logical plan optimiser to push down volatile predicates. But it seems like the physical optimiser still pushes this predicate to an executor. While it helps us with automatic sampling, the results could be wrong. How do you think – should we implement a similar mechanism to make volatile predicates as unsupported filters? This problem has been recently fixed in #16545 - I've added a slt testcase here to verify the correct expected behaviour. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]
milenkovicm commented on PR #16325: URL: https://github.com/apache/datafusion/pull/16325#issuecomment-3021791270 I wonder would creating new physical plan operator to do per batch sampling avoid issues @theirix mentioned. Something similar to https://github.com/milenkovicm/ballista_extensions/blob/master/src/physical/sample_exec.rs ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]
theirix commented on PR #16325: URL: https://github.com/apache/datafusion/pull/16325#issuecomment-3003725980 > I'd like to double-check if a volatile filter pushdown to a Parquet executor is expected. In the mentioned PR, I disabled optimisation in a logical plan optimiser to push down volatile predicates. But it seems like the physical optimiser still pushes this predicate to an executor. While it helps us with automatic sampling, the results could be wrong. How do you think – should we implement a similar mechanism to make volatile predicates as unsupported filters? Moved to a new discussion issue #16545, since it's not strictly related to this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]
theirix commented on PR #16325:
URL: https://github.com/apache/datafusion/pull/16325#issuecomment-2985522134
> According to PostgreSQL's reference:
https://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation#SYSTEM_Option I
believe `SYSTEM` option is equivalent to keep the entire `RecordBatch`
according to the specified probability, this rewrite rule implemented here is
sampling row by row, which follows the behavior of `BERNOULLI` option. Since df
has vectorized execution, evaluation a `random() < x` filter should be
efficient, I think we can apply this implementation on both `SYSTEM` and
`BERNOULLI` option to keep it simple.
@2010YOUY01 I'd like to double-check if a volatile filter pushdown to a
Parquet executor is expected. In the mentioned PR, I disabled optimisation in a
logical plan optimiser to push down volatile predicates. But it seems like the
physical optimiser still pushes this predicate to an executor. While it helps
us with automatic sampling, the results could be wrong. How do you think –
should we implement a similar mechanism to make volatile predicates as
unsupported filters?
Before:
```
[2025-06-18T18:20:07Z TRACE datafusion::physical_planner] Optimized physical
plan by LimitedDistinctAggregation:
OutputRequirementExec
ProjectionExec: expr=[count(Int64(1))@0 as count(*)]
AggregateExec: mode=Final, gby=[], aggr=[count(Int64(1))]
AggregateExec: mode=Partial, gby=[], aggr=[count(Int64(1))]
FilterExec: random() < 0.1
DataSourceExec: file_groups={1 group: [[sample.parquet]]},
file_type=parquet
```
After:
```
[2025-06-18T18:20:07Z TRACE datafusion::physical_planner] Optimized physical
plan by FilterPushdown:
OutputRequirementExec
ProjectionExec: expr=[count(Int64(1))@0 as count(*)]
AggregateExec: mode=Final, gby=[], aggr=[count(Int64(1))]
AggregateExec: mode=Partial, gby=[], aggr=[count(Int64(1))]
DataSourceExec: file_groups={1 group: [[sample.parquet]]},
file_type=parquet, predicate=random() < 0.1
```
Data:
set datafusion.execution.parquet.pushdown_filters=true;
create external table data stored as parquet location 'sample.parquet';
SELECT count(*) FROM data WHERE random() < 0.1;
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]
theirix commented on code in PR #16325:
URL: https://github.com/apache/datafusion/pull/16325#discussion_r2146926102
##
datafusion/sql/tests/sql_integration.rs:
##
@@ -4714,3 +4714,115 @@ fn test_using_join_wildcard_schema() {
]
);
}
+
+#[test]
Review Comment:
Reworked tests in slt files, covered extra cases.
I have included explanations in the code regarding the handling of system vs
row sampling.
The remaining part is handling joins (subqueries work fine, verified with
integration tests). I put a TODO about it to address in a future PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]
theirix commented on PR #16325: URL: https://github.com/apache/datafusion/pull/16325#issuecomment-2962290508 Thank you for the review and suggestions! I'll rework the testing approach and get back with the improved version. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]
xudong963 commented on code in PR #16325:
URL: https://github.com/apache/datafusion/pull/16325#discussion_r2135409445
##
datafusion/sql/src/select.rs:
##
@@ -77,11 +82,29 @@ impl SqlToRel<'_, S> {
}
// Process `from` clause
-let plan = self.plan_from_tables(select.from, planner_context)?;
+let plan = self.plan_from_tables(select.from.clone(),
planner_context)?;
let empty_from = matches!(plan, LogicalPlan::EmptyRelation(_));
// Process `where` clause
-let base_plan = self.plan_selection(select.selection, plan,
planner_context)?;
+let mut base_plan =
Review Comment:
"logical optimizer rule" mainly focuses on **Optimization**, I think it's
fair to rewrite during planning phase.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]
2010YOUY01 commented on code in PR #16325:
URL: https://github.com/apache/datafusion/pull/16325#discussion_r2135088072
##
datafusion/sql/src/select.rs:
##
@@ -77,11 +82,29 @@ impl SqlToRel<'_, S> {
}
// Process `from` clause
-let plan = self.plan_from_tables(select.from, planner_context)?;
+let plan = self.plan_from_tables(select.from.clone(),
planner_context)?;
let empty_from = matches!(plan, LogicalPlan::EmptyRelation(_));
// Process `where` clause
-let base_plan = self.plan_selection(select.selection, plan,
planner_context)?;
+let mut base_plan =
Review Comment:
I feel we'd better do this rewrite in a separate logical optimizer rule, to
keep the planning code clean. It can be done with a follow-up PR before adding
more functionality to scan sampling.
(Unless there's a specific reason to do this during the planning phase — I
did notice some rewrites happening during planning, but I'm not sure why.)
##
datafusion/sql/tests/sql_integration.rs:
##
@@ -4714,3 +4714,115 @@ fn test_using_join_wildcard_schema() {
]
);
}
+
+#[test]
Review Comment:
Regarding test structure, I suggest:
1. Move all of the `sql_integration` tests to `sqllogictest`, since `.slt`s
are easier to maintain.
To only show logical plans, you can use
```
set datafusion.explain.logical_plan_only = true;
# sqllogictest tests
# cleanup
set datafusion.explain.logical_plan_only = false;
```
2. Create a separate `.slt` file for all tests related to `TABLESAMPLE`
To improve test coverage, I recommend to add the following test cases
1. Select from multiple table, and test only some of table with sample / all
of the tables have sample.
2. Test all sample methods in
https://github.com/apache/datafusion-sqlparser-rs/blob/84c3a1b325c39c879b68ab712e3b9b3e3e40ed56/src/ast/query.rs#L1475
and expect error for unimplemented ones.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]
2010YOUY01 commented on code in PR #16325:
URL: https://github.com/apache/datafusion/pull/16325#discussion_r2135088072
##
datafusion/sql/src/select.rs:
##
@@ -77,11 +82,29 @@ impl SqlToRel<'_, S> {
}
// Process `from` clause
-let plan = self.plan_from_tables(select.from, planner_context)?;
+let plan = self.plan_from_tables(select.from.clone(),
planner_context)?;
let empty_from = matches!(plan, LogicalPlan::EmptyRelation(_));
// Process `where` clause
-let base_plan = self.plan_selection(select.selection, plan,
planner_context)?;
+let mut base_plan =
Review Comment:
I feel we'd better do this rewrite in a separate logical optimizer rule, to
keep the planning code clean. It can be done with a follow-up PR before adding
more functionality to scan sampling.
(Unless there's a specific reason to do this during the planning phase — I
did notice some rewrites happening there, but I'm not sure why.)
##
datafusion/sql/tests/sql_integration.rs:
##
@@ -4714,3 +4714,115 @@ fn test_using_join_wildcard_schema() {
]
);
}
+
+#[test]
Review Comment:
Regarding test structure, I suggest:
1. Move all of the `sql_integration` tests to `sqllogictest`, since `.slt`s
are easier to maintain.
To only show logical plans, you can use
```
set datafusion.explain.logical_plan_only = true;
# sqllogictest tests
# cleanup
set datafusion.explain.logical_plan_only = true;
```
2. Create a separate `.slt` file for all tests related to `TABLESAMPLE`
To improve test coverage, I recommend to add the following test cases
1. Select from multiple table, and test only some of table with sample / all
of the tables have sample.
2. Test all sample methods in
https://github.com/apache/datafusion-sqlparser-rs/blob/84c3a1b325c39c879b68ab712e3b9b3e3e40ed56/src/ast/query.rs#L1475
and expect error for unimplemented ones.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
