Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]

2026-04-17 Thread via GitHub


adriangbot commented on PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#issuecomment-4269011242

   πŸ€– Benchmark completed (GKE) | 
[trigger](https://github.com/apache/datafusion/pull/21674#issuecomment-4268806819)
   
   **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB)
   
   CPU Details (lscpu)
   
   ```
   Architecture:aarch64
   CPU op-mode(s):  64-bit
   Byte Order:  Little Endian
   CPU(s):  16
   On-line CPU(s) list: 0-15
   Vendor ID:   ARM
   Model name:  Neoverse-V2
   Model:   1
   Thread(s) per core:  1
   Core(s) per cluster: 16
   Socket(s):   -
   Cluster(s):  1
   Stepping:r0p1
   BogoMIPS:2000.00
   Flags:   fp asimd evtstrm aes pmull sha1 
sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 
sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 
sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm 
bf16 dgh rng bti
   L1d cache:   1 MiB (16 instances)
   L1i cache:   1 MiB (16 instances)
   L2 cache:32 MiB (16 instances)
   L3 cache:80 MiB (1 instance)
   NUMA node(s):1
   NUMA node0 CPU(s):   0-15
   Vulnerability Gather data sampling:  Not affected
   Vulnerability Indirect target selection: Not affected
   Vulnerability Itlb multihit: Not affected
   Vulnerability L1tf:  Not affected
   Vulnerability Mds:   Not affected
   Vulnerability Meltdown:  Not affected
   Vulnerability Mmio stale data:   Not affected
   Vulnerability Reg file data sampling:Not affected
   Vulnerability Retbleed:  Not affected
   Vulnerability Spec rstack overflow:  Not affected
   Vulnerability Spec store bypass: Mitigation; Speculative Store 
Bypass disabled via prctl
   Vulnerability Spectre v1:Mitigation; __user pointer 
sanitization
   Vulnerability Spectre v2:Mitigation; CSV2, BHB
   Vulnerability Srbds: Not affected
   Vulnerability Tsa:   Not affected
   Vulnerability Tsx async abort:   Not affected
   Vulnerability Vmscape:   Not affected
   ```
   
   
   
   Details
   
   
   ```
   Comparing HEAD and feat_sort-pushdown-inexact-benchmark
   
   Benchmark sort_pushdown_inexact_overlap.json
   
   
┏━━━┳┳━━┳━━━┓
   ┃ Query ┃   HEAD ┃ 
feat_sort-pushdown-inexact-benchmark ┃Change ┃
   
┑━━━╇╇━━╇━━━┩
   β”‚ Q1β”‚4.89 / 5.96 Β±0.89 / 7.55 ms β”‚  4.63 / 5.61 Β±1.02 / 7.44 
ms β”‚ +1.06x faster β”‚
   β”‚ Q2β”‚4.99 / 5.38 Β±0.28 / 5.67 ms β”‚  4.95 / 5.28 Β±0.26 / 5.66 
ms β”‚ no change β”‚
   β”‚ Q3β”‚ 13.87 / 14.66 Β±0.61 / 15.73 ms β”‚   14.43 / 14.93 Β±0.42 / 15.62 
ms β”‚ no change β”‚
   β”‚ Q4β”‚ 14.49 / 14.62 Β±0.11 / 14.80 ms β”‚   14.64 / 14.68 Β±0.05 / 14.77 
ms β”‚ no change β”‚
   
β””β”€β”€β”€β”΄β”΄β”€β”€β”΄β”€β”€β”€β”˜
   ┏━┳━┓
   ┃ Benchmark Summary   ┃ ┃
   ┑━╇━┩
   β”‚ Total Time (HEAD)   β”‚ 40.62ms β”‚
   β”‚ Total Time (feat_sort-pushdown-inexact-benchmark)   β”‚ 40.50ms β”‚
   β”‚ Average Time (HEAD) β”‚ 10.16ms β”‚
   β”‚ Average Time (feat_sort-pushdown-inexact-benchmark) β”‚ 10.12ms β”‚
   β”‚ Queries Faster  β”‚   1 β”‚
   β”‚ Queries Slower  β”‚   0 β”‚
   β”‚ Queries with No Change  β”‚   3 β”‚
   β”‚ Queries with Failureβ”‚   0 β”‚
   β””β”€β”΄β”€β”˜
   ```
   
   
   
   
   Resource Usage
   
   **sort_pushdown_inexact_overlap β€” base (merge-base)**
   | Metric | Value |
   ||---|
   | Wall time | 0.4s |
   | Peak memory | 4.7 GiB |
   | Avg memory | 4.7 GiB |
   | CPU user | 1.8s |
   | CPU sys | 0.2s |
   | Peak spill | 0 B |
   
   **sort_pushdown_inexact_overlap β€” branch**
   | Metric | Value |
   ||---|
   | Wal

Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]

2026-04-17 Thread via GitHub


adriangbot commented on PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#issuecomment-4268812748

   πŸ€– Benchmark running (GKE) | 
[trigger](https://github.com/apache/datafusion/pull/21674#issuecomment-4268806819)
   **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux 
bench-c4268806819-1422-9cxfj 6.12.55+ #1 SMP Sun Feb  1 08:59:41 UTC 2026 
aarch64 GNU/Linux`
   CPU Details (lscpu)
   
   ```
   Architecture:aarch64
   CPU op-mode(s):  64-bit
   Byte Order:  Little Endian
   CPU(s):  16
   On-line CPU(s) list: 0-15
   Vendor ID:   ARM
   Model name:  Neoverse-V2
   Model:   1
   Thread(s) per core:  1
   Core(s) per cluster: 16
   Socket(s):   -
   Cluster(s):  1
   Stepping:r0p1
   BogoMIPS:2000.00
   Flags:   fp asimd evtstrm aes pmull sha1 
sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 
sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 
sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm 
bf16 dgh rng bti
   L1d cache:   1 MiB (16 instances)
   L1i cache:   1 MiB (16 instances)
   L2 cache:32 MiB (16 instances)
   L3 cache:80 MiB (1 instance)
   NUMA node(s):1
   NUMA node0 CPU(s):   0-15
   Vulnerability Gather data sampling:  Not affected
   Vulnerability Indirect target selection: Not affected
   Vulnerability Itlb multihit: Not affected
   Vulnerability L1tf:  Not affected
   Vulnerability Mds:   Not affected
   Vulnerability Meltdown:  Not affected
   Vulnerability Mmio stale data:   Not affected
   Vulnerability Reg file data sampling:Not affected
   Vulnerability Retbleed:  Not affected
   Vulnerability Spec rstack overflow:  Not affected
   Vulnerability Spec store bypass: Mitigation; Speculative Store 
Bypass disabled via prctl
   Vulnerability Spectre v1:Mitigation; __user pointer 
sanitization
   Vulnerability Spectre v2:Mitigation; CSV2, BHB
   Vulnerability Srbds: Not affected
   Vulnerability Tsa:   Not affected
   Vulnerability Tsx async abort:   Not affected
   Vulnerability Vmscape:   Not affected
   ```
   
   
   
   Comparing feat/sort-pushdown-inexact-benchmark 
(f1215b9659947b3f1579b592dbff82e747fb3638) to 7bfa3fb (merge-base) 
[diff](https://github.com/apache/datafusion/compare/7bfa3fbdf8e0eedfacb844048b3ee764155a1670..f1215b9659947b3f1579b592dbff82e747fb3638)
 using: sort_pushdown_inexact_overlap
   Results will be posted here when complete
   
   ---
   [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) 
against this benchmark runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]

2026-04-17 Thread via GitHub


adriangb commented on PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#issuecomment-4268806819

   run benchmark sort_pushdown_inexact_overlap


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]

2026-04-17 Thread via GitHub


adriangb merged PR #21674:
URL: https://github.com/apache/datafusion/pull/21674


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]

2026-04-17 Thread via GitHub


adriangbot commented on PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#issuecomment-4268410780

   Benchmark for [this 
request](https://github.com/apache/datafusion/pull/21674#issuecomment-4268402463)
 failed.
   
   Last 20 lines of output:
   Click to expand
   
   ```
   struct_query_sql
   substr
   substr_index
   substring
   sum
   to_char
   to_hex
   to_local_time
   to_time
   to_timestamp
   topk_aggregate
   topk_repartition
   translate
   trim
   trunc
   unhex
   upper
   uuid
   window_query_sql
   with_hashes
   ```
   
   
   
   ---
   [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) 
against this benchmark runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]

2026-04-17 Thread via GitHub


adriangbot commented on PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#issuecomment-4268410080

   πŸ€– Criterion benchmark running (GKE) | 
[trigger](https://github.com/apache/datafusion/pull/21674#issuecomment-4268402463)
   **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux 
bench-c4268402463-1421-h7zl5 6.12.55+ #1 SMP Sun Feb  1 08:59:41 UTC 2026 
aarch64 GNU/Linux`
   CPU Details (lscpu)
   
   ```
   Architecture:aarch64
   CPU op-mode(s):  64-bit
   Byte Order:  Little Endian
   CPU(s):  16
   On-line CPU(s) list: 0-15
   Vendor ID:   ARM
   Model name:  Neoverse-V2
   Model:   1
   Thread(s) per core:  1
   Core(s) per cluster: 16
   Socket(s):   -
   Cluster(s):  1
   Stepping:r0p1
   BogoMIPS:2000.00
   Flags:   fp asimd evtstrm aes pmull sha1 
sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 
sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 
sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm 
bf16 dgh rng bti
   L1d cache:   1 MiB (16 instances)
   L1i cache:   1 MiB (16 instances)
   L2 cache:32 MiB (16 instances)
   L3 cache:80 MiB (1 instance)
   NUMA node(s):1
   NUMA node0 CPU(s):   0-15
   Vulnerability Gather data sampling:  Not affected
   Vulnerability Indirect target selection: Not affected
   Vulnerability Itlb multihit: Not affected
   Vulnerability L1tf:  Not affected
   Vulnerability Mds:   Not affected
   Vulnerability Meltdown:  Not affected
   Vulnerability Mmio stale data:   Not affected
   Vulnerability Reg file data sampling:Not affected
   Vulnerability Retbleed:  Not affected
   Vulnerability Spec rstack overflow:  Not affected
   Vulnerability Spec store bypass: Mitigation; Speculative Store 
Bypass disabled via prctl
   Vulnerability Spectre v1:Mitigation; __user pointer 
sanitization
   Vulnerability Spectre v2:Mitigation; CSV2, BHB
   Vulnerability Srbds: Not affected
   Vulnerability Tsa:   Not affected
   Vulnerability Tsx async abort:   Not affected
   Vulnerability Vmscape:   Not affected
   ```
   
   
   
   Comparing feat/sort-pushdown-inexact-benchmark 
(f1215b9659947b3f1579b592dbff82e747fb3638) to 7bfa3fb (merge-base) 
[diff](https://github.com/apache/datafusion/compare/7bfa3fbdf8e0eedfacb844048b3ee764155a1670..f1215b9659947b3f1579b592dbff82e747fb3638)
   BENCH_NAME=sort_pushdown_inexact_overlap
   BENCH_COMMAND=cargo bench --features=parquet --bench 
sort_pushdown_inexact_overlap
   BENCH_FILTER=
   Results will be posted here when complete
   
   ---
   [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) 
against this benchmark runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]

2026-04-17 Thread via GitHub


adriangb commented on PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#issuecomment-4268402463

   run benchmark sort_pushdown_inexact_overlap


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]

2026-04-17 Thread via GitHub


zhuqi-lucas commented on code in PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#discussion_r3098830995


##
benchmarks/bench.sh:
##
@@ -1137,6 +1144,62 @@ run_sort_pushdown_sorted() {
 debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted 
--iterations 5 --path "${SORT_PUSHDOWN_DIR}" --queries-path 
"${SCRIPT_DIR}/queries/sort_pushdown" -o "${RESULTS_FILE}" ${QUERY_ARG} 
${LATENCY_ARG}
 }
 
+# Generates data for sort pushdown Inexact benchmark.
+#
+# Produces a single large lineitem parquet file where row groups have
+# NON-OVERLAPPING but OUT-OF-ORDER l_orderkey ranges (each RG internally
+# sorted, RGs shuffled). This simulates append-heavy workloads where data
+# is written in batches at different times.
+data_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact/lineitem"
+if [ -d "${INEXACT_DIR}" ] && [ "$(ls -A ${INEXACT_DIR}/*.parquet 
2>/dev/null)" ]; then
+echo "Sort pushdown Inexact data already exists at ${INEXACT_DIR}"
+return
+fi
+
+echo "Generating sort pushdown Inexact benchmark data (single file, 
shuffled RGs)..."
+
+# Re-use the sort_pushdown data as the source (generate if missing)
+data_sort_pushdown
+
+mkdir -p "${INEXACT_DIR}"
+SRC_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+
+# Use datafusion-cli to bucket rows into 64 groups by a deterministic
+# scrambler, then sort within each bucket by orderkey. This produces
+# ~64 RG-sized segments where each has a tight orderkey range but the
+# segments appear in scrambled (non-sorted) order in the file.

Review Comment:
   Added in the latest push β€” `sort_pushdown_inexact_overlap` generates a file 
with partially overlapping RGs (Β±2500 orderkey jitter between adjacent 100K-row 
chunks, simulating streaming data with network delays). 4 DESC LIMIT queries to 
match your use case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]

2026-04-16 Thread via GitHub


adriangb commented on code in PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#discussion_r3094162913


##
benchmarks/bench.sh:
##
@@ -1137,6 +1144,62 @@ run_sort_pushdown_sorted() {
 debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted 
--iterations 5 --path "${SORT_PUSHDOWN_DIR}" --queries-path 
"${SCRIPT_DIR}/queries/sort_pushdown" -o "${RESULTS_FILE}" ${QUERY_ARG} 
${LATENCY_ARG}
 }
 
+# Generates data for sort pushdown Inexact benchmark.
+#
+# Produces a single large lineitem parquet file where row groups have
+# NON-OVERLAPPING but OUT-OF-ORDER l_orderkey ranges (each RG internally
+# sorted, RGs shuffled). This simulates append-heavy workloads where data
+# is written in batches at different times.
+data_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact/lineitem"
+if [ -d "${INEXACT_DIR}" ] && [ "$(ls -A ${INEXACT_DIR}/*.parquet 
2>/dev/null)" ]; then
+echo "Sort pushdown Inexact data already exists at ${INEXACT_DIR}"
+return
+fi
+
+echo "Generating sort pushdown Inexact benchmark data (single file, 
shuffled RGs)..."
+
+# Re-use the sort_pushdown data as the source (generate if missing)
+data_sort_pushdown
+
+mkdir -p "${INEXACT_DIR}"
+SRC_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+
+# Use datafusion-cli to bucket rows into 64 groups by a deterministic
+# scrambler, then sort within each bucket by orderkey. This produces
+# ~64 RG-sized segments where each has a tight orderkey range but the
+# segments appear in scrambled (non-sorted) order in the file.

Review Comment:
   Amazing thanks so much



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]

2026-04-16 Thread via GitHub


zhuqi-lucas commented on code in PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#discussion_r3094084842


##
benchmarks/bench.sh:
##
@@ -1137,6 +1144,62 @@ run_sort_pushdown_sorted() {
 debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted 
--iterations 5 --path "${SORT_PUSHDOWN_DIR}" --queries-path 
"${SCRIPT_DIR}/queries/sort_pushdown" -o "${RESULTS_FILE}" ${QUERY_ARG} 
${LATENCY_ARG}
 }
 
+# Generates data for sort pushdown Inexact benchmark.
+#
+# Produces a single large lineitem parquet file where row groups have
+# NON-OVERLAPPING but OUT-OF-ORDER l_orderkey ranges (each RG internally
+# sorted, RGs shuffled). This simulates append-heavy workloads where data
+# is written in batches at different times.
+data_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact/lineitem"
+if [ -d "${INEXACT_DIR}" ] && [ "$(ls -A ${INEXACT_DIR}/*.parquet 
2>/dev/null)" ]; then
+echo "Sort pushdown Inexact data already exists at ${INEXACT_DIR}"
+return
+fi
+
+echo "Generating sort pushdown Inexact benchmark data (single file, 
shuffled RGs)..."
+
+# Re-use the sort_pushdown data as the source (generate if missing)
+data_sort_pushdown
+
+mkdir -p "${INEXACT_DIR}"
+SRC_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+
+# Use datafusion-cli to bucket rows into 64 groups by a deterministic
+# scrambler, then sort within each bucket by orderkey. This produces
+# ~64 RG-sized segments where each has a tight orderkey range but the
+# segments appear in scrambled (non-sorted) order in the file.

Review Comment:
   Will add it in this PR β€” I'll create a third data variant with partially 
overlapping RGs (simulating streaming data with network delays) alongside the 
current shuffled data.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]

2026-04-16 Thread via GitHub


adriangb commented on code in PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#discussion_r3094036956


##
benchmarks/bench.sh:
##
@@ -1137,6 +1144,62 @@ run_sort_pushdown_sorted() {
 debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted 
--iterations 5 --path "${SORT_PUSHDOWN_DIR}" --queries-path 
"${SCRIPT_DIR}/queries/sort_pushdown" -o "${RESULTS_FILE}" ${QUERY_ARG} 
${LATENCY_ARG}
 }
 
+# Generates data for sort pushdown Inexact benchmark.
+#
+# Produces a single large lineitem parquet file where row groups have
+# NON-OVERLAPPING but OUT-OF-ORDER l_orderkey ranges (each RG internally
+# sorted, RGs shuffled). This simulates append-heavy workloads where data
+# is written in batches at different times.
+data_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact/lineitem"
+if [ -d "${INEXACT_DIR}" ] && [ "$(ls -A ${INEXACT_DIR}/*.parquet 
2>/dev/null)" ]; then
+echo "Sort pushdown Inexact data already exists at ${INEXACT_DIR}"
+return
+fi
+
+echo "Generating sort pushdown Inexact benchmark data (single file, 
shuffled RGs)..."
+
+# Re-use the sort_pushdown data as the source (generate if missing)
+data_sort_pushdown
+
+mkdir -p "${INEXACT_DIR}"
+SRC_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+
+# Use datafusion-cli to bucket rows into 64 groups by a deterministic
+# scrambler, then sort within each bucket by orderkey. This produces
+# ~64 RG-sized segments where each has a tight orderkey range but the
+# segments appear in scrambled (non-sorted) order in the file.

Review Comment:
   Did you plan to include it in this PR or a followup?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]

2026-04-16 Thread via GitHub


zhuqi-lucas commented on code in PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#discussion_r3093910770


##
benchmarks/bench.sh:
##
@@ -1137,6 +1144,59 @@ run_sort_pushdown_sorted() {
 debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted 
--iterations 5 --path "${SORT_PUSHDOWN_DIR}" --queries-path 
"${SCRIPT_DIR}/queries/sort_pushdown" -o "${RESULTS_FILE}" ${QUERY_ARG} 
${LATENCY_ARG}
 }
 
+# Generates data for sort pushdown Inexact benchmark.
+#
+# Produces a single large lineitem parquet file where row groups have
+# NON-OVERLAPPING but OUT-OF-ORDER l_orderkey ranges (each RG internally
+# sorted, RGs shuffled). This simulates append-heavy workloads where data
+# is written in batches at different times.
+data_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact/lineitem"
+if [ -d "${INEXACT_DIR}" ] && [ "$(ls -A ${INEXACT_DIR}/*.parquet 
2>/dev/null)" ]; then
+echo "Sort pushdown Inexact data already exists at ${INEXACT_DIR}"
+return
+fi
+
+echo "Generating sort pushdown Inexact benchmark data (single file, 
shuffled RGs)..."
+
+# Re-use the sort_pushdown data as the source (generate if missing)
+data_sort_pushdown
+
+mkdir -p "${INEXACT_DIR}"
+SRC_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+
+# Use datafusion-cli to bucket rows into 64 groups by a deterministic
+# scrambler, then sort within each bucket by orderkey. This produces
+# ~64 RG-sized segments where each has a tight orderkey range but the
+# segments appear in scrambled (non-sorted) order in the file.
+(cd "${SCRIPT_DIR}/.." && cargo run --release -p datafusion-cli -- -c "
+CREATE EXTERNAL TABLE src
+STORED AS PARQUET
+LOCATION '${SRC_DIR}';
+
+COPY (
+SELECT * FROM src
+ORDER BY
+(l_orderkey * 1664525 + 1013904223) % 64,
+l_orderkey
+)
+TO '${INEXACT_DIR}/shuffled.parquet'
+STORED AS PARQUET
+OPTIONS ('format.max_row_group_size' '10');
+")
+
+echo "Sort pushdown Inexact data generated at ${INEXACT_DIR}"
+ls -la "${INEXACT_DIR}"
+}
+
+# Runs the sort pushdown Inexact benchmark (tests RG reorder by statistics).
+run_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact"
+RESULTS_FILE="${RESULTS_DIR}/sort_pushdown_inexact.json"
+echo "Running sort pushdown Inexact benchmark (row group reorder by 
statistics)..."
+debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted 
--iterations 5 --path "${INEXACT_DIR}" --queries-path 
"${SCRIPT_DIR}/queries/sort_pushdown_inexact" -o "${RESULTS_FILE}" ${QUERY_ARG} 
${LATENCY_ARG}

Review Comment:
   `--sorted` adds `WITH ORDER (l_orderkey ASC)` which is needed to trigger the 
reverse scan path in `try_pushdown_sort` β€” currently the only path where 
`reorder_by_statistics` is called.
   
   After reviewing this more carefully, I realized we need two benchmark suites 
to cover the full optimization path:
   
   1. `sort_pushdown_inexact` (with `--sorted`, DESC queries) β€” tests the 
reverse scan path where RG reorder is already supported 
   2. `sort_pushdown_inexact_unsorted` (without `--sorted`, ASC+DESC queries) β€” 
tests the Unsupported path where RG reorder will be supported in a follow-up PR 
(#21580)
   
   Updated in the latest push. This way each follow-up PR can run its 
corresponding benchmark to show the improvement.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]

2026-04-16 Thread via GitHub


zhuqi-lucas commented on code in PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#discussion_r3093910770


##
benchmarks/bench.sh:
##
@@ -1137,6 +1144,59 @@ run_sort_pushdown_sorted() {
 debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted 
--iterations 5 --path "${SORT_PUSHDOWN_DIR}" --queries-path 
"${SCRIPT_DIR}/queries/sort_pushdown" -o "${RESULTS_FILE}" ${QUERY_ARG} 
${LATENCY_ARG}
 }
 
+# Generates data for sort pushdown Inexact benchmark.
+#
+# Produces a single large lineitem parquet file where row groups have
+# NON-OVERLAPPING but OUT-OF-ORDER l_orderkey ranges (each RG internally
+# sorted, RGs shuffled). This simulates append-heavy workloads where data
+# is written in batches at different times.
+data_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact/lineitem"
+if [ -d "${INEXACT_DIR}" ] && [ "$(ls -A ${INEXACT_DIR}/*.parquet 
2>/dev/null)" ]; then
+echo "Sort pushdown Inexact data already exists at ${INEXACT_DIR}"
+return
+fi
+
+echo "Generating sort pushdown Inexact benchmark data (single file, 
shuffled RGs)..."
+
+# Re-use the sort_pushdown data as the source (generate if missing)
+data_sort_pushdown
+
+mkdir -p "${INEXACT_DIR}"
+SRC_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+
+# Use datafusion-cli to bucket rows into 64 groups by a deterministic
+# scrambler, then sort within each bucket by orderkey. This produces
+# ~64 RG-sized segments where each has a tight orderkey range but the
+# segments appear in scrambled (non-sorted) order in the file.
+(cd "${SCRIPT_DIR}/.." && cargo run --release -p datafusion-cli -- -c "
+CREATE EXTERNAL TABLE src
+STORED AS PARQUET
+LOCATION '${SRC_DIR}';
+
+COPY (
+SELECT * FROM src
+ORDER BY
+(l_orderkey * 1664525 + 1013904223) % 64,
+l_orderkey
+)
+TO '${INEXACT_DIR}/shuffled.parquet'
+STORED AS PARQUET
+OPTIONS ('format.max_row_group_size' '10');
+")
+
+echo "Sort pushdown Inexact data generated at ${INEXACT_DIR}"
+ls -la "${INEXACT_DIR}"
+}
+
+# Runs the sort pushdown Inexact benchmark (tests RG reorder by statistics).
+run_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact"
+RESULTS_FILE="${RESULTS_DIR}/sort_pushdown_inexact.json"
+echo "Running sort pushdown Inexact benchmark (row group reorder by 
statistics)..."
+debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted 
--iterations 5 --path "${INEXACT_DIR}" --queries-path 
"${SCRIPT_DIR}/queries/sort_pushdown_inexact" -o "${RESULTS_FILE}" ${QUERY_ARG} 
${LATENCY_ARG}

Review Comment:
   `--sorted` adds `WITH ORDER (l_orderkey ASC)` which is needed to trigger the 
reverse scan path in `try_pushdown_sort` β€” currently the only path where 
`reorder_by_statistics` is called.
   
   After reviewing this more carefully, I realized we need two benchmark suites 
to cover the full optimization path:
   
   1. `sort_pushdown_inexact` (with `--sorted`, DESC queries) β€” tests the 
reverse scan path where RG reorder is already supported (#21580)
   2. `sort_pushdown_inexact_unsorted` (without `--sorted`, ASC+DESC queries) β€” 
tests the Unsupported path where RG reorder will be supported in a follow-up PR
   
   Updated in the latest push. This way each follow-up PR can run its 
corresponding benchmark to show the improvement.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]

2026-04-16 Thread via GitHub


zhuqi-lucas commented on code in PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#discussion_r3093847933


##
benchmarks/bench.sh:
##
@@ -1137,6 +1144,59 @@ run_sort_pushdown_sorted() {
 debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted 
--iterations 5 --path "${SORT_PUSHDOWN_DIR}" --queries-path 
"${SCRIPT_DIR}/queries/sort_pushdown" -o "${RESULTS_FILE}" ${QUERY_ARG} 
${LATENCY_ARG}
 }
 
+# Generates data for sort pushdown Inexact benchmark.
+#
+# Produces a single large lineitem parquet file where row groups have
+# NON-OVERLAPPING but OUT-OF-ORDER l_orderkey ranges (each RG internally
+# sorted, RGs shuffled). This simulates append-heavy workloads where data
+# is written in batches at different times.
+data_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact/lineitem"
+if [ -d "${INEXACT_DIR}" ] && [ "$(ls -A ${INEXACT_DIR}/*.parquet 
2>/dev/null)" ]; then
+echo "Sort pushdown Inexact data already exists at ${INEXACT_DIR}"
+return
+fi
+
+echo "Generating sort pushdown Inexact benchmark data (single file, 
shuffled RGs)..."
+
+# Re-use the sort_pushdown data as the source (generate if missing)
+data_sort_pushdown
+
+mkdir -p "${INEXACT_DIR}"
+SRC_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+
+# Use datafusion-cli to bucket rows into 64 groups by a deterministic
+# scrambler, then sort within each bucket by orderkey. This produces
+# ~64 RG-sized segments where each has a tight orderkey range but the
+# segments appear in scrambled (non-sorted) order in the file.
+(cd "${SCRIPT_DIR}/.." && cargo run --release -p datafusion-cli -- -c "
+CREATE EXTERNAL TABLE src
+STORED AS PARQUET
+LOCATION '${SRC_DIR}';
+
+COPY (
+SELECT * FROM src
+ORDER BY
+(l_orderkey * 1664525 + 1013904223) % 64,
+l_orderkey
+)
+TO '${INEXACT_DIR}/shuffled.parquet'
+STORED AS PARQUET
+OPTIONS ('format.max_row_group_size' '10');
+")
+
+echo "Sort pushdown Inexact data generated at ${INEXACT_DIR}"
+ls -la "${INEXACT_DIR}"
+}
+
+# Runs the sort pushdown Inexact benchmark (tests RG reorder by statistics).
+run_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact"
+RESULTS_FILE="${RESULTS_DIR}/sort_pushdown_inexact.json"
+echo "Running sort pushdown Inexact benchmark (row group reorder by 
statistics)..."
+debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted 
--iterations 5 --path "${INEXACT_DIR}" --queries-path 
"${SCRIPT_DIR}/queries/sort_pushdown_inexact" -o "${RESULTS_FILE}" ${QUERY_ARG} 
${LATENCY_ARG}

Review Comment:
   `--sorted` adds `WITH ORDER (l_orderkey ASC)` to the table, which is needed 
to trigger the reverse scan path in `try_pushdown_sort` β€” currently the only 
path where `reorder_by_statistics` is called. Without it, sort pushdown returns 
Unsupported and reorder is not triggered.
   
   I am planning follow-up PRs to:
   1. Support exact reverse scan (eliminate SortExec entirely for DESC on 
ASC-sorted files) 
   2. Enable RG reorder in the Unsupported path as well (no WITH ORDER needed)
   
   Once those land, I will update this benchmark PR to cover all paths with 
proper comparisons. Going to put this PR on hold for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]

2026-04-16 Thread via GitHub


zhuqi-lucas commented on code in PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#discussion_r3093849484


##
benchmarks/bench.sh:
##
@@ -1137,6 +1144,62 @@ run_sort_pushdown_sorted() {
 debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted 
--iterations 5 --path "${SORT_PUSHDOWN_DIR}" --queries-path 
"${SCRIPT_DIR}/queries/sort_pushdown" -o "${RESULTS_FILE}" ${QUERY_ARG} 
${LATENCY_ARG}
 }
 
+# Generates data for sort pushdown Inexact benchmark.
+#
+# Produces a single large lineitem parquet file where row groups have
+# NON-OVERLAPPING but OUT-OF-ORDER l_orderkey ranges (each RG internally
+# sorted, RGs shuffled). This simulates append-heavy workloads where data
+# is written in batches at different times.
+data_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact/lineitem"
+if [ -d "${INEXACT_DIR}" ] && [ "$(ls -A ${INEXACT_DIR}/*.parquet 
2>/dev/null)" ]; then
+echo "Sort pushdown Inexact data already exists at ${INEXACT_DIR}"
+return
+fi
+
+echo "Generating sort pushdown Inexact benchmark data (single file, 
shuffled RGs)..."
+
+# Re-use the sort_pushdown data as the source (generate if missing)
+data_sort_pushdown
+
+mkdir -p "${INEXACT_DIR}"
+SRC_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+
+# Use datafusion-cli to bucket rows into 64 groups by a deterministic
+# scrambler, then sort within each bucket by orderkey. This produces
+# ~64 RG-sized segments where each has a tight orderkey range but the
+# segments appear in scrambled (non-sorted) order in the file.

Review Comment:
   Great suggestion! Partially overlapping RGs from streaming data is a very 
realistic scenario. I will add a benchmark variant for this pattern when I 
update the PR β€” something like time-ordered chunks with small overlaps between 
adjacent chunks to simulate network delays / time skew.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]

2026-04-16 Thread via GitHub


zhuqi-lucas commented on code in PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#discussion_r3093847933


##
benchmarks/bench.sh:
##
@@ -1137,6 +1144,59 @@ run_sort_pushdown_sorted() {
 debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted 
--iterations 5 --path "${SORT_PUSHDOWN_DIR}" --queries-path 
"${SCRIPT_DIR}/queries/sort_pushdown" -o "${RESULTS_FILE}" ${QUERY_ARG} 
${LATENCY_ARG}
 }
 
+# Generates data for sort pushdown Inexact benchmark.
+#
+# Produces a single large lineitem parquet file where row groups have
+# NON-OVERLAPPING but OUT-OF-ORDER l_orderkey ranges (each RG internally
+# sorted, RGs shuffled). This simulates append-heavy workloads where data
+# is written in batches at different times.
+data_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact/lineitem"
+if [ -d "${INEXACT_DIR}" ] && [ "$(ls -A ${INEXACT_DIR}/*.parquet 
2>/dev/null)" ]; then
+echo "Sort pushdown Inexact data already exists at ${INEXACT_DIR}"
+return
+fi
+
+echo "Generating sort pushdown Inexact benchmark data (single file, 
shuffled RGs)..."
+
+# Re-use the sort_pushdown data as the source (generate if missing)
+data_sort_pushdown
+
+mkdir -p "${INEXACT_DIR}"
+SRC_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+
+# Use datafusion-cli to bucket rows into 64 groups by a deterministic
+# scrambler, then sort within each bucket by orderkey. This produces
+# ~64 RG-sized segments where each has a tight orderkey range but the
+# segments appear in scrambled (non-sorted) order in the file.
+(cd "${SCRIPT_DIR}/.." && cargo run --release -p datafusion-cli -- -c "
+CREATE EXTERNAL TABLE src
+STORED AS PARQUET
+LOCATION '${SRC_DIR}';
+
+COPY (
+SELECT * FROM src
+ORDER BY
+(l_orderkey * 1664525 + 1013904223) % 64,
+l_orderkey
+)
+TO '${INEXACT_DIR}/shuffled.parquet'
+STORED AS PARQUET
+OPTIONS ('format.max_row_group_size' '10');
+")
+
+echo "Sort pushdown Inexact data generated at ${INEXACT_DIR}"
+ls -la "${INEXACT_DIR}"
+}
+
+# Runs the sort pushdown Inexact benchmark (tests RG reorder by statistics).
+run_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact"
+RESULTS_FILE="${RESULTS_DIR}/sort_pushdown_inexact.json"
+echo "Running sort pushdown Inexact benchmark (row group reorder by 
statistics)..."
+debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted 
--iterations 5 --path "${INEXACT_DIR}" --queries-path 
"${SCRIPT_DIR}/queries/sort_pushdown_inexact" -o "${RESULTS_FILE}" ${QUERY_ARG} 
${LATENCY_ARG}

Review Comment:
   `--sorted` adds `WITH ORDER (l_orderkey ASC)` to the table, which is needed 
to trigger the reverse scan path in `try_pushdown_sort` β€” currently the only 
path where `reorder_by_statistics` is called. Without it, sort pushdown returns 
Unsupported and reorder is not triggered.
   
   I am planning follow-up PRs to:
   1. Support exact reverse scan (eliminate SortExec entirely for DESC on 
ASC-sorted files) 
   2. Enable RG reorder in the Unsupported path as well (no WITH ORDER needed)
   
   Once those land, I will update this benchmark PR to cover all paths with 
proper comparisons. Going to put this PR on hold for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]

2026-04-16 Thread via GitHub


adriangb commented on code in PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#discussion_r3093307815


##
benchmarks/bench.sh:
##
@@ -1137,6 +1144,59 @@ run_sort_pushdown_sorted() {
 debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted 
--iterations 5 --path "${SORT_PUSHDOWN_DIR}" --queries-path 
"${SCRIPT_DIR}/queries/sort_pushdown" -o "${RESULTS_FILE}" ${QUERY_ARG} 
${LATENCY_ARG}
 }
 
+# Generates data for sort pushdown Inexact benchmark.
+#
+# Produces a single large lineitem parquet file where row groups have
+# NON-OVERLAPPING but OUT-OF-ORDER l_orderkey ranges (each RG internally
+# sorted, RGs shuffled). This simulates append-heavy workloads where data
+# is written in batches at different times.
+data_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact/lineitem"
+if [ -d "${INEXACT_DIR}" ] && [ "$(ls -A ${INEXACT_DIR}/*.parquet 
2>/dev/null)" ]; then
+echo "Sort pushdown Inexact data already exists at ${INEXACT_DIR}"
+return
+fi
+
+echo "Generating sort pushdown Inexact benchmark data (single file, 
shuffled RGs)..."
+
+# Re-use the sort_pushdown data as the source (generate if missing)
+data_sort_pushdown
+
+mkdir -p "${INEXACT_DIR}"
+SRC_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+
+# Use datafusion-cli to bucket rows into 64 groups by a deterministic
+# scrambler, then sort within each bucket by orderkey. This produces
+# ~64 RG-sized segments where each has a tight orderkey range but the
+# segments appear in scrambled (non-sorted) order in the file.
+(cd "${SCRIPT_DIR}/.." && cargo run --release -p datafusion-cli -- -c "
+CREATE EXTERNAL TABLE src
+STORED AS PARQUET
+LOCATION '${SRC_DIR}';
+
+COPY (
+SELECT * FROM src
+ORDER BY
+(l_orderkey * 1664525 + 1013904223) % 64,
+l_orderkey
+)
+TO '${INEXACT_DIR}/shuffled.parquet'
+STORED AS PARQUET
+OPTIONS ('format.max_row_group_size' '10');
+")
+
+echo "Sort pushdown Inexact data generated at ${INEXACT_DIR}"
+ls -la "${INEXACT_DIR}"
+}
+
+# Runs the sort pushdown Inexact benchmark (tests RG reorder by statistics).
+run_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact"
+RESULTS_FILE="${RESULTS_DIR}/sort_pushdown_inexact.json"
+echo "Running sort pushdown Inexact benchmark (row group reorder by 
statistics)..."
+debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted 
--iterations 5 --path "${INEXACT_DIR}" --queries-path 
"${SCRIPT_DIR}/queries/sort_pushdown_inexact" -o "${RESULTS_FILE}" ${QUERY_ARG} 
${LATENCY_ARG}

Review Comment:
   I'm not sure what the `--sorted` flag is supposed to do but worth checking 
@zhuqi-lucas 



##
benchmarks/bench.sh:
##
@@ -1137,6 +1144,62 @@ run_sort_pushdown_sorted() {
 debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted 
--iterations 5 --path "${SORT_PUSHDOWN_DIR}" --queries-path 
"${SCRIPT_DIR}/queries/sort_pushdown" -o "${RESULTS_FILE}" ${QUERY_ARG} 
${LATENCY_ARG}
 }
 
+# Generates data for sort pushdown Inexact benchmark.
+#
+# Produces a single large lineitem parquet file where row groups have
+# NON-OVERLAPPING but OUT-OF-ORDER l_orderkey ranges (each RG internally
+# sorted, RGs shuffled). This simulates append-heavy workloads where data
+# is written in batches at different times.
+data_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact/lineitem"
+if [ -d "${INEXACT_DIR}" ] && [ "$(ls -A ${INEXACT_DIR}/*.parquet 
2>/dev/null)" ]; then
+echo "Sort pushdown Inexact data already exists at ${INEXACT_DIR}"
+return
+fi
+
+echo "Generating sort pushdown Inexact benchmark data (single file, 
shuffled RGs)..."
+
+# Re-use the sort_pushdown data as the source (generate if missing)
+data_sort_pushdown
+
+mkdir -p "${INEXACT_DIR}"
+SRC_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+
+# Use datafusion-cli to bucket rows into 64 groups by a deterministic
+# scrambler, then sort within each bucket by orderkey. This produces
+# ~64 RG-sized segments where each has a tight orderkey range but the
+# segments appear in scrambled (non-sorted) order in the file.

Review Comment:
   I think another interesting benchmark, which is our case at least, is when 
there is overlap between the row groups but some general logic. E.g.:
   
   ```
   rg,min,max
   0,0,5
   1,4,7
   2,5,32
   ```
   
   In our case, this happens because there's a stream of data that's coming in 
with time stamps, and it should arrive at around the same time it was created, 
but there are always some network delays, time skew, etc. that means that it's 
not perfect.  But data that arrived 30 minutes later is guaranteed to have 
timestamps in a different range than data that arrive 30 minutes before it. 



-- 
This is an autom

Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]

2026-04-16 Thread via GitHub


zhuqi-lucas commented on code in PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#discussion_r3093277104


##
benchmarks/src/sort_pushdown.rs:
##
@@ -159,7 +159,11 @@ impl RunOpt {
 async fn benchmark_query(&self, query_id: usize) -> 
Result> {
 let sql = self.load_query(query_id)?;
 
-let config = self.common.config()?;
+let mut config = self.common.config()?;
+// Enable filter pushdown so TopK's dynamic filter is pushed to the
+// parquet reader for late materialization β€” only sort-column rows
+// pass the filter, non-sort columns are skipped for filtered rows.
+config.options_mut().execution.parquet.pushdown_filters = true;

Review Comment:
   Good catch @adriangb !  Fixed β€” reverted the sort_pushdown.rs change and 
moved pushdown_filters to an env var only for the inexact benchmark run. Other 
sort_pushdown benchmarks are unaffected.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]

2026-04-16 Thread via GitHub


zhuqi-lucas commented on code in PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#discussion_r3093277104


##
benchmarks/src/sort_pushdown.rs:
##
@@ -159,7 +159,11 @@ impl RunOpt {
 async fn benchmark_query(&self, query_id: usize) -> 
Result> {
 let sql = self.load_query(query_id)?;
 
-let config = self.common.config()?;
+let mut config = self.common.config()?;
+// Enable filter pushdown so TopK's dynamic filter is pushed to the
+// parquet reader for late materialization β€” only sort-column rows
+// pass the filter, non-sort columns are skipped for filtered rows.
+config.options_mut().execution.parquet.pushdown_filters = true;

Review Comment:
   Good catch! Fixed β€” reverted the sort_pushdown.rs change and moved 
pushdown_filters to an env var only for the inexact benchmark run. Other 
sort_pushdown benchmarks are unaffected.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]

2026-04-16 Thread via GitHub


adriangb commented on code in PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#discussion_r3093233786


##
benchmarks/src/sort_pushdown.rs:
##
@@ -159,7 +159,11 @@ impl RunOpt {
 async fn benchmark_query(&self, query_id: usize) -> 
Result> {
 let sql = self.load_query(query_id)?;
 
-let config = self.common.config()?;
+let mut config = self.common.config()?;
+// Enable filter pushdown so TopK's dynamic filter is pushed to the
+// parquet reader for late materialization β€” only sort-column rows
+// pass the filter, non-sort columns are skipped for filtered rows.
+config.options_mut().execution.parquet.pushdown_filters = true;

Review Comment:
   Does this enable it for all benchmarks? Generally we control this on a 
per-run basis.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]

2026-04-16 Thread via GitHub


Copilot commented on code in PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#discussion_r3093205802


##
benchmarks/bench.sh:
##
@@ -1137,6 +1144,59 @@ run_sort_pushdown_sorted() {
 debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted 
--iterations 5 --path "${SORT_PUSHDOWN_DIR}" --queries-path 
"${SCRIPT_DIR}/queries/sort_pushdown" -o "${RESULTS_FILE}" ${QUERY_ARG} 
${LATENCY_ARG}
 }
 
+# Generates data for sort pushdown Inexact benchmark.
+#
+# Produces a single large lineitem parquet file where row groups have
+# NON-OVERLAPPING but OUT-OF-ORDER l_orderkey ranges (each RG internally
+# sorted, RGs shuffled). This simulates append-heavy workloads where data
+# is written in batches at different times.
+data_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact/lineitem"
+if [ -d "${INEXACT_DIR}" ] && [ "$(ls -A ${INEXACT_DIR}/*.parquet 
2>/dev/null)" ]; then
+echo "Sort pushdown Inexact data already exists at ${INEXACT_DIR}"
+return
+fi
+
+echo "Generating sort pushdown Inexact benchmark data (single file, 
shuffled RGs)..."
+
+# Re-use the sort_pushdown data as the source (generate if missing)
+data_sort_pushdown
+
+mkdir -p "${INEXACT_DIR}"
+SRC_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+
+# Use datafusion-cli to bucket rows into 64 groups by a deterministic
+# scrambler, then sort within each bucket by orderkey. This produces
+# ~64 RG-sized segments where each has a tight orderkey range but the
+# segments appear in scrambled (non-sorted) order in the file.
+(cd "${SCRIPT_DIR}/.." && cargo run --release -p datafusion-cli -- -c "

Review Comment:
   This uses `cargo run --release -p datafusion-cli` directly, which bypasses 
the script’s `CARGO_COMMAND` override (often used to add 
flags/features/profiles) and also skips the `debug_run` wrapper used elsewhere 
for reproducibility. Consider invoking datafusion-cli via the same configurable 
mechanism (e.g., reusing `CARGO_COMMAND` or building the CLI once and executing 
the binary) so benchmark data generation behaves consistently across 
environments.
   ```suggestion
   (cd "${SCRIPT_DIR}/.." && debug_run ${CARGO_COMMAND} -p datafusion-cli 
-- -c "
   ```



##
benchmarks/bench.sh:
##
@@ -1137,6 +1144,59 @@ run_sort_pushdown_sorted() {
 debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted 
--iterations 5 --path "${SORT_PUSHDOWN_DIR}" --queries-path 
"${SCRIPT_DIR}/queries/sort_pushdown" -o "${RESULTS_FILE}" ${QUERY_ARG} 
${LATENCY_ARG}
 }
 
+# Generates data for sort pushdown Inexact benchmark.
+#
+# Produces a single large lineitem parquet file where row groups have
+# NON-OVERLAPPING but OUT-OF-ORDER l_orderkey ranges (each RG internally
+# sorted, RGs shuffled). This simulates append-heavy workloads where data
+# is written in batches at different times.
+data_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact/lineitem"
+if [ -d "${INEXACT_DIR}" ] && [ "$(ls -A ${INEXACT_DIR}/*.parquet 
2>/dev/null)" ]; then
+echo "Sort pushdown Inexact data already exists at ${INEXACT_DIR}"
+return
+fi
+
+echo "Generating sort pushdown Inexact benchmark data (single file, 
shuffled RGs)..."
+
+# Re-use the sort_pushdown data as the source (generate if missing)
+data_sort_pushdown
+
+mkdir -p "${INEXACT_DIR}"
+SRC_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+
+# Use datafusion-cli to bucket rows into 64 groups by a deterministic
+# scrambler, then sort within each bucket by orderkey. This produces
+# ~64 RG-sized segments where each has a tight orderkey range but the
+# segments appear in scrambled (non-sorted) order in the file.
+(cd "${SCRIPT_DIR}/.." && cargo run --release -p datafusion-cli -- -c "
+CREATE EXTERNAL TABLE src
+STORED AS PARQUET
+LOCATION '${SRC_DIR}';
+
+COPY (
+SELECT * FROM src
+ORDER BY
+(l_orderkey * 1664525 + 1013904223) % 64,
+l_orderkey
+)
+TO '${INEXACT_DIR}/shuffled.parquet'
+STORED AS PARQUET
+OPTIONS ('format.max_row_group_size' '10');
+")
+
+echo "Sort pushdown Inexact data generated at ${INEXACT_DIR}"
+ls -la "${INEXACT_DIR}"
+}
+
+# Runs the sort pushdown Inexact benchmark (tests RG reorder by statistics).
+run_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact"
+RESULTS_FILE="${RESULTS_DIR}/sort_pushdown_inexact.json"
+echo "Running sort pushdown Inexact benchmark (row group reorder by 
statistics)..."
+debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted 
--iterations 5 --path "${INEXACT_DIR}" --queries-path 
"${SCRIPT_DIR}/queries/sort_pushdown_inexact" -o "${RESULTS_FILE}" ${QUERY_ARG} 
${LATENCY_ARG}

Review Comment:
   `run_sort_pushdown_inexact` passes `--sorted`, which tells DataFusion