Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]

2025-12-09 Thread via GitHub


theirix commented on PR #16325:
URL: https://github.com/apache/datafusion/pull/16325#issuecomment-3633724002

   Closing in favour of #17843


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]

2025-12-09 Thread via GitHub


theirix closed pull request #16325: Support data source sampling with 
TABLESAMPLE
URL: https://github.com/apache/datafusion/pull/16325


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]

2025-10-18 Thread via GitHub


killme2008 commented on PR #16325:
URL: https://github.com/apache/datafusion/pull/16325#issuecomment-3413687171

   Great work! May I ask what's the status of this PR? Looking forward to using 
it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]

2025-10-17 Thread via GitHub


theirix commented on PR #16325:
URL: https://github.com/apache/datafusion/pull/16325#issuecomment-3415252972

   > Great work! May I ask what's the status of this PR? Looking forward to 
using it.
   
   Thank you! The more modern and extensible approach is proposed in my PR here 
https://github.com/apache/datafusion/pull/17633 , and it should be a way to go 
with table sampling support


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]

2025-09-19 Thread via GitHub


theirix commented on PR #16325:
URL: https://github.com/apache/datafusion/pull/16325#issuecomment-3311180917

   Unstale for a reference in the improved PR (sampling via extension)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]

2025-08-18 Thread via GitHub


alamb commented on PR #16325:
URL: https://github.com/apache/datafusion/pull/16325#issuecomment-3196876280

   > We can consider implementing a `LIMIT` either for the approach in this PR, 
or for a possible physical plan operator. I could work on an operator 
implementation if this approach doesn't look flexible enough for maintainers. 
@2010YOUY01, @alamb , what do you think?
   
   Sorry for my late reply -- I wrote up some thoughts here:
   - https://github.com/apache/datafusion/issues/13563#issuecomment-3196874378


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]

2025-08-13 Thread via GitHub


theirix commented on PR #16325:
URL: https://github.com/apache/datafusion/pull/16325#issuecomment-3185015782

   Thank you so much, @aditanase! Your `LIMIT` idea looks useful and should 
bring performance improvements.
   
   There are numerous advanced use cases and possible data source-level 
optimisations for table sampling. To fully support them, the executor should be 
provided with all the information about sampling. This PR focuses on a simple 
filter propagation. As @milenkovicm suggested, introducing a special operator 
is more flexible and preserves sampling semantics, which could help implement 
such features.
   
   We can consider implementing a `LIMIT` either for the approach in this PR, 
or for a possible physical plan operator. I could work on an operator 
implementation if this approach doesn't look flexible enough for maintainers. 
@2010YOUY01, @alamb , what do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]

2025-08-13 Thread via GitHub


aditanase commented on PR #16325:
URL: https://github.com/apache/datafusion/pull/16325#issuecomment-3184147237

   @theirix many Thanks for this PR! The `SYSTEM` sampling strategy is really 
useful for large tables, and I've seen more interesting variants on this in the 
ray project, where shuffling input data is a key requirement for training 
robust models: 
   https://docs.ray.io/en/latest/data/shuffling-data.html
   
   At our company we're heavy users of delta table format and we're using 
`delta-rs` to consume them.
   In that scenario, you can do file pruning (partitions, statistics) and limit 
pushdowns from the metadata layer, which happens as part of predicate pushdown 
and limit pushdown.
   
   I've recently authored a PR that enables LIMIT pushdown for partition 
predicates for delta tables:
   
https://github.com/delta-io/delta-rs/pull/3436/commits/15f2ade11c8627173bfca6568c9f3f6f2dd6c619
   
   Have you considered this use case? Wondering what we would need to do at 
this stage to pass in enough information in the datasource to turn the SYSTEM 
hint in a metadata operation and keep other optimizations alive. In the 
delta-rs case we could simply randomize the order of active files before 
pruning them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]

2025-07-30 Thread via GitHub


theirix commented on PR #16325:
URL: https://github.com/apache/datafusion/pull/16325#issuecomment-3137673146

   > @2010YOUY01 I'd like to double-check if a volatile filter pushdown to a 
Parquet executor is expected. In the mentioned PR, I disabled optimisation in a 
logical plan optimiser to push down volatile predicates. But it seems like the 
physical optimiser still pushes this predicate to an executor. While it helps 
us with automatic sampling, the results could be wrong. How do you think – 
should we implement a similar mechanism to make volatile predicates as 
unsupported filters?
   
   This problem has been recently fixed in #16545 - I've added a slt testcase 
here to verify the correct expected behaviour.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]

2025-06-30 Thread via GitHub


milenkovicm commented on PR #16325:
URL: https://github.com/apache/datafusion/pull/16325#issuecomment-3021791270

   I wonder would creating new physical plan operator to do per batch sampling 
avoid issues @theirix mentioned. 
   
   Something similar to 
https://github.com/milenkovicm/ballista_extensions/blob/master/src/physical/sample_exec.rs
 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]

2025-06-25 Thread via GitHub


theirix commented on PR #16325:
URL: https://github.com/apache/datafusion/pull/16325#issuecomment-3003725980

   > I'd like to double-check if a volatile filter pushdown to a Parquet 
executor is expected. In the mentioned PR, I disabled optimisation in a logical 
plan optimiser to push down volatile predicates. But it seems like the physical 
optimiser still pushes this predicate to an executor. While it helps us with 
automatic sampling, the results could be wrong. How do you think – should we 
implement a similar mechanism to make volatile predicates as unsupported 
filters?
   
   Moved to a new discussion issue #16545, since it's not strictly related to 
this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]

2025-06-18 Thread via GitHub


theirix commented on PR #16325:
URL: https://github.com/apache/datafusion/pull/16325#issuecomment-2985522134

   > According to PostgreSQL's reference: 
https://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation#SYSTEM_Option I 
believe `SYSTEM` option is equivalent to keep the entire `RecordBatch` 
according to the specified probability, this rewrite rule implemented here is 
sampling row by row, which follows the behavior of `BERNOULLI` option. Since df 
has vectorized execution, evaluation a `random() < x` filter should be 
efficient, I think we can apply this implementation on both `SYSTEM` and 
`BERNOULLI` option to keep it simple.
   
   @2010YOUY01 I'd like to double-check if a volatile filter pushdown to a 
Parquet executor is expected. In the mentioned PR, I disabled optimisation in a 
logical plan optimiser to push down volatile predicates. But it seems like the 
physical optimiser still pushes this predicate to an executor. While it helps 
us with automatic sampling, the results could be wrong. How do you think – 
should we implement a similar mechanism to make volatile predicates as 
unsupported filters?
   
   Before:
   ```
   [2025-06-18T18:20:07Z TRACE datafusion::physical_planner] Optimized physical 
plan by LimitedDistinctAggregation:
   OutputRequirementExec
 ProjectionExec: expr=[count(Int64(1))@0 as count(*)]
   AggregateExec: mode=Final, gby=[], aggr=[count(Int64(1))]
 AggregateExec: mode=Partial, gby=[], aggr=[count(Int64(1))]
   FilterExec: random() < 0.1
 DataSourceExec: file_groups={1 group: [[sample.parquet]]}, 
file_type=parquet
   ```
   
   After:
   ```
   [2025-06-18T18:20:07Z TRACE datafusion::physical_planner] Optimized physical 
plan by FilterPushdown:
   OutputRequirementExec
 ProjectionExec: expr=[count(Int64(1))@0 as count(*)]
   AggregateExec: mode=Final, gby=[], aggr=[count(Int64(1))]
 AggregateExec: mode=Partial, gby=[], aggr=[count(Int64(1))]
   DataSourceExec: file_groups={1 group: [[sample.parquet]]}, 
file_type=parquet, predicate=random() < 0.1
   ```
   
   Data:
   
   set datafusion.execution.parquet.pushdown_filters=true;
   create external table data stored as parquet location 'sample.parquet';
   SELECT count(*) FROM data WHERE random() < 0.1;
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]

2025-06-14 Thread via GitHub


theirix commented on code in PR #16325:
URL: https://github.com/apache/datafusion/pull/16325#discussion_r2146926102


##
datafusion/sql/tests/sql_integration.rs:
##
@@ -4714,3 +4714,115 @@ fn test_using_join_wildcard_schema() {
 ]
 );
 }
+
+#[test]

Review Comment:
   Reworked tests in slt files, covered extra cases.
   
   I have included explanations in the code regarding the handling of system vs 
row sampling.
   
   The remaining part is handling joins (subqueries work fine, verified with 
integration tests). I put a TODO about it to address in a future PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]

2025-06-11 Thread via GitHub


theirix commented on PR #16325:
URL: https://github.com/apache/datafusion/pull/16325#issuecomment-2962290508

   Thank you for the review and suggestions! I'll rework the testing approach 
and get back with the improved version.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]

2025-06-09 Thread via GitHub


xudong963 commented on code in PR #16325:
URL: https://github.com/apache/datafusion/pull/16325#discussion_r2135409445


##
datafusion/sql/src/select.rs:
##
@@ -77,11 +82,29 @@ impl SqlToRel<'_, S> {
 }
 
 // Process `from` clause
-let plan = self.plan_from_tables(select.from, planner_context)?;
+let plan = self.plan_from_tables(select.from.clone(), 
planner_context)?;
 let empty_from = matches!(plan, LogicalPlan::EmptyRelation(_));
 
 // Process `where` clause
-let base_plan = self.plan_selection(select.selection, plan, 
planner_context)?;
+let mut base_plan =

Review Comment:
   "logical optimizer rule" mainly focuses on **Optimization**, I think it's 
fair to rewrite during planning phase.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]

2025-06-08 Thread via GitHub


2010YOUY01 commented on code in PR #16325:
URL: https://github.com/apache/datafusion/pull/16325#discussion_r2135088072


##
datafusion/sql/src/select.rs:
##
@@ -77,11 +82,29 @@ impl SqlToRel<'_, S> {
 }
 
 // Process `from` clause
-let plan = self.plan_from_tables(select.from, planner_context)?;
+let plan = self.plan_from_tables(select.from.clone(), 
planner_context)?;
 let empty_from = matches!(plan, LogicalPlan::EmptyRelation(_));
 
 // Process `where` clause
-let base_plan = self.plan_selection(select.selection, plan, 
planner_context)?;
+let mut base_plan =

Review Comment:
   I feel we'd better do this rewrite in a separate logical optimizer rule, to 
keep the planning code clean. It can be done with a follow-up PR before adding 
more functionality to scan sampling.
   (Unless there's a specific reason to do this during the planning phase — I 
did notice some rewrites happening during planning, but I'm not sure why.)



##
datafusion/sql/tests/sql_integration.rs:
##
@@ -4714,3 +4714,115 @@ fn test_using_join_wildcard_schema() {
 ]
 );
 }
+
+#[test]

Review Comment:
   Regarding test structure, I suggest:
   1. Move all of the `sql_integration` tests to `sqllogictest`, since `.slt`s 
are easier to maintain.
   To only show logical plans, you can use
   ```
   set datafusion.explain.logical_plan_only = true;
   
   # sqllogictest tests
   
   # cleanup
   set datafusion.explain.logical_plan_only = false;
   ```
   2. Create a separate `.slt` file for all tests related to `TABLESAMPLE`
   
   To improve test coverage, I recommend to add the following test cases
   1. Select from multiple table, and test only some of table with sample / all 
of the tables have sample.
   2. Test all sample methods in
   
https://github.com/apache/datafusion-sqlparser-rs/blob/84c3a1b325c39c879b68ab712e3b9b3e3e40ed56/src/ast/query.rs#L1475
   and expect error for unimplemented ones.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]

2025-06-08 Thread via GitHub


2010YOUY01 commented on code in PR #16325:
URL: https://github.com/apache/datafusion/pull/16325#discussion_r2135088072


##
datafusion/sql/src/select.rs:
##
@@ -77,11 +82,29 @@ impl SqlToRel<'_, S> {
 }
 
 // Process `from` clause
-let plan = self.plan_from_tables(select.from, planner_context)?;
+let plan = self.plan_from_tables(select.from.clone(), 
planner_context)?;
 let empty_from = matches!(plan, LogicalPlan::EmptyRelation(_));
 
 // Process `where` clause
-let base_plan = self.plan_selection(select.selection, plan, 
planner_context)?;
+let mut base_plan =

Review Comment:
   I feel we'd better do this rewrite in a separate logical optimizer rule, to 
keep the planning code clean. It can be done with a follow-up PR before adding 
more functionality to scan sampling.
   (Unless there's a specific reason to do this during the planning phase — I 
did notice some rewrites happening there, but I'm not sure why.)



##
datafusion/sql/tests/sql_integration.rs:
##
@@ -4714,3 +4714,115 @@ fn test_using_join_wildcard_schema() {
 ]
 );
 }
+
+#[test]

Review Comment:
   Regarding test structure, I suggest:
   1. Move all of the `sql_integration` tests to `sqllogictest`, since `.slt`s 
are easier to maintain.
   To only show logical plans, you can use
   ```
   set datafusion.explain.logical_plan_only = true;
   
   # sqllogictest tests
   
   # cleanup
   set datafusion.explain.logical_plan_only = true;
   ```
   2. Create a separate `.slt` file for all tests related to `TABLESAMPLE`
   
   To improve test coverage, I recommend to add the following test cases
   1. Select from multiple table, and test only some of table with sample / all 
of the tables have sample.
   2. Test all sample methods in
   
https://github.com/apache/datafusion-sqlparser-rs/blob/84c3a1b325c39c879b68ab712e3b9b3e3e40ed56/src/ast/query.rs#L1475
   and expect error for unimplemented ones.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]