This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion.git


The following commit(s) were added to refs/heads/main by this push:
     new 0af9ff5998 Improve sqllogicteset speed by creating only a single large 
file rather than 2 (#20586)
0af9ff5998 is described below

commit 0af9ff59986bee835915c818c7733599a1e52ba6
Author: Tim-53 <[email protected]>
AuthorDate: Mon Mar 2 17:15:28 2026 +0100

    Improve sqllogicteset speed by creating only a single large file rather 
than 2 (#20586)
    
    Draft as it builds on https://github.com/apache/datafusion/pull/20576
    
    
    ## Which issue does this PR close?
     - Part of https://github.com/apache/datafusion/issues/20524
    - Follow on to https://github.com/apache/datafusion/pull/20576 from
    @alamb
    
    ## Rationale for this change
    Execution time of the test is dominated by the time writing the parquet
    files. By reusing the file we can gain around 30% improvement on the
    execution time here.
    
    ## What changes are included in this PR?
    
    Building on #20576 we reuse the needed parquet file for the test instead
    of recreating it.
    
    ## Are these changes tested?
    Ran the test with following results:
    
    | | Baseline (2 files) | Optimized (1 file) |
    |---|---|---|
    | Min | 33.000s | 22.653s |
    | Max | 37.662s | 25.489s |
    | Avg | 34.427s | 24.092s |
    
    One open question: does the correctness of this regression test rely on
    having two **physically separate** files? The race condition in #17197
    was in the execution layer — both scans would still be independent
    `DataSourceExec` nodes with independent readers, so I believe the
    behavior is preserved. But if there's any concern, we could use `system
    cp` to copy the file and register two physical files while still only
    paying the `generate_series` cost once.
    
    ## Are there any user-facing changes?
---
 .../sqllogictest/test_files/push_down_filter_regression.slt   | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/datafusion/sqllogictest/test_files/push_down_filter_regression.slt 
b/datafusion/sqllogictest/test_files/push_down_filter_regression.slt
index ca4a30fa96..8459fcc682 100644
--- a/datafusion/sqllogictest/test_files/push_down_filter_regression.slt
+++ b/datafusion/sqllogictest/test_files/push_down_filter_regression.slt
@@ -18,13 +18,6 @@
 # Test push down filter
 
 # Regression test for https://github.com/apache/datafusion/issues/17188
-query I
-COPY (select i as k from generate_series(1, 10000000) as t(i))
-TO 'test_files/scratch/push_down_filter_regression/t1.parquet'
-STORED AS PARQUET;
-----
-10000000
-
 query I
 COPY (select i as k, i as v from generate_series(1, 10000000) as t(i))
 TO 'test_files/scratch/push_down_filter_regression/t2.parquet'
@@ -33,10 +26,10 @@ STORED AS PARQUET;
 10000000
 
 statement ok
-create external table t1 stored as parquet location 
'test_files/scratch/push_down_filter_regression/t1.parquet';
+create external table t2 stored as parquet location 
'test_files/scratch/push_down_filter_regression/t2.parquet';
 
 statement ok
-create external table t2 stored as parquet location 
'test_files/scratch/push_down_filter_regression/t2.parquet';
+create external table t1 (k bigint not null) stored as parquet location 
'test_files/scratch/push_down_filter_regression/t2.parquet';
 
 # The failure before https://github.com/apache/datafusion/pull/17197 was 
non-deterministic and random
 # So we'll run the same query a couple of times just to have more certainty 
it's fixed


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to