This is an automated email from the ASF dual-hosted git repository.
github-bot pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion.git
The following commit(s) were added to refs/heads/main by this push:
new 0af9ff5998 Improve sqllogicteset speed by creating only a single large
file rather than 2 (#20586)
0af9ff5998 is described below
commit 0af9ff59986bee835915c818c7733599a1e52ba6
Author: Tim-53 <[email protected]>
AuthorDate: Mon Mar 2 17:15:28 2026 +0100
Improve sqllogicteset speed by creating only a single large file rather
than 2 (#20586)
Draft as it builds on https://github.com/apache/datafusion/pull/20576
## Which issue does this PR close?
- Part of https://github.com/apache/datafusion/issues/20524
- Follow on to https://github.com/apache/datafusion/pull/20576 from
@alamb
## Rationale for this change
Execution time of the test is dominated by the time writing the parquet
files. By reusing the file we can gain around 30% improvement on the
execution time here.
## What changes are included in this PR?
Building on #20576 we reuse the needed parquet file for the test instead
of recreating it.
## Are these changes tested?
Ran the test with following results:
| | Baseline (2 files) | Optimized (1 file) |
|---|---|---|
| Min | 33.000s | 22.653s |
| Max | 37.662s | 25.489s |
| Avg | 34.427s | 24.092s |
One open question: does the correctness of this regression test rely on
having two **physically separate** files? The race condition in #17197
was in the execution layer — both scans would still be independent
`DataSourceExec` nodes with independent readers, so I believe the
behavior is preserved. But if there's any concern, we could use `system
cp` to copy the file and register two physical files while still only
paying the `generate_series` cost once.
## Are there any user-facing changes?
---
.../sqllogictest/test_files/push_down_filter_regression.slt | 11 ++---------
1 file changed, 2 insertions(+), 9 deletions(-)
diff --git a/datafusion/sqllogictest/test_files/push_down_filter_regression.slt
b/datafusion/sqllogictest/test_files/push_down_filter_regression.slt
index ca4a30fa96..8459fcc682 100644
--- a/datafusion/sqllogictest/test_files/push_down_filter_regression.slt
+++ b/datafusion/sqllogictest/test_files/push_down_filter_regression.slt
@@ -18,13 +18,6 @@
# Test push down filter
# Regression test for https://github.com/apache/datafusion/issues/17188
-query I
-COPY (select i as k from generate_series(1, 10000000) as t(i))
-TO 'test_files/scratch/push_down_filter_regression/t1.parquet'
-STORED AS PARQUET;
-----
-10000000
-
query I
COPY (select i as k, i as v from generate_series(1, 10000000) as t(i))
TO 'test_files/scratch/push_down_filter_regression/t2.parquet'
@@ -33,10 +26,10 @@ STORED AS PARQUET;
10000000
statement ok
-create external table t1 stored as parquet location
'test_files/scratch/push_down_filter_regression/t1.parquet';
+create external table t2 stored as parquet location
'test_files/scratch/push_down_filter_regression/t2.parquet';
statement ok
-create external table t2 stored as parquet location
'test_files/scratch/push_down_filter_regression/t2.parquet';
+create external table t1 (k bigint not null) stored as parquet location
'test_files/scratch/push_down_filter_regression/t2.parquet';
# The failure before https://github.com/apache/datafusion/pull/17197 was
non-deterministic and random
# So we'll run the same query a couple of times just to have more certainty
it's fixed
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]