Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]
adriangbot commented on PR #21674: URL: https://github.com/apache/datafusion/pull/21674#issuecomment-4269011242 π€ Benchmark completed (GKE) | [trigger](https://github.com/apache/datafusion/pull/21674#issuecomment-4268806819) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) CPU Details (lscpu) ``` Architecture:aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 16 Socket(s): - Cluster(s): 1 Stepping:r0p1 BogoMIPS:2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache:32 MiB (16 instances) L3 cache:80 MiB (1 instance) NUMA node(s):1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling:Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1:Mitigation; __user pointer sanitization Vulnerability Spectre v2:Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected ``` Details ``` Comparing HEAD and feat_sort-pushdown-inexact-benchmark Benchmark sort_pushdown_inexact_overlap.json βββββ³β³βββ³ββββ β Query β HEAD β feat_sort-pushdown-inexact-benchmark βChange β β‘ββββββββββββ© β Q1β4.89 / 5.96 Β±0.89 / 7.55 ms β 4.63 / 5.61 Β±1.02 / 7.44 ms β +1.06x faster β β Q2β4.99 / 5.38 Β±0.28 / 5.67 ms β 4.95 / 5.28 Β±0.26 / 5.66 ms β no change β β Q3β 13.87 / 14.66 Β±0.61 / 15.73 ms β 14.43 / 14.93 Β±0.42 / 15.62 ms β no change β β Q4β 14.49 / 14.62 Β±0.11 / 14.80 ms β 14.64 / 14.68 Β±0.05 / 14.77 ms β no change β βββββ΄β΄βββ΄ββββ βββ³ββ β Benchmark Summary β β β‘ββββ© β Total Time (HEAD) β 40.62ms β β Total Time (feat_sort-pushdown-inexact-benchmark) β 40.50ms β β Average Time (HEAD) β 10.16ms β β Average Time (feat_sort-pushdown-inexact-benchmark) β 10.12ms β β Queries Faster β 1 β β Queries Slower β 0 β β Queries with No Change β 3 β β Queries with Failureβ 0 β βββ΄ββ ``` Resource Usage **sort_pushdown_inexact_overlap β base (merge-base)** | Metric | Value | ||---| | Wall time | 0.4s | | Peak memory | 4.7 GiB | | Avg memory | 4.7 GiB | | CPU user | 1.8s | | CPU sys | 0.2s | | Peak spill | 0 B | **sort_pushdown_inexact_overlap β branch** | Metric | Value | ||---| | Wal
Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]
adriangbot commented on PR #21674: URL: https://github.com/apache/datafusion/pull/21674#issuecomment-4268812748 π€ Benchmark running (GKE) | [trigger](https://github.com/apache/datafusion/pull/21674#issuecomment-4268806819) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux bench-c4268806819-1422-9cxfj 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux` CPU Details (lscpu) ``` Architecture:aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 16 Socket(s): - Cluster(s): 1 Stepping:r0p1 BogoMIPS:2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache:32 MiB (16 instances) L3 cache:80 MiB (1 instance) NUMA node(s):1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling:Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1:Mitigation; __user pointer sanitization Vulnerability Spectre v2:Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected ``` Comparing feat/sort-pushdown-inexact-benchmark (f1215b9659947b3f1579b592dbff82e747fb3638) to 7bfa3fb (merge-base) [diff](https://github.com/apache/datafusion/compare/7bfa3fbdf8e0eedfacb844048b3ee764155a1670..f1215b9659947b3f1579b592dbff82e747fb3638) using: sort_pushdown_inexact_overlap Results will be posted here when complete --- [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) against this benchmark runner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]
adriangb commented on PR #21674: URL: https://github.com/apache/datafusion/pull/21674#issuecomment-4268806819 run benchmark sort_pushdown_inexact_overlap -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]
adriangb merged PR #21674: URL: https://github.com/apache/datafusion/pull/21674 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]
adriangbot commented on PR #21674: URL: https://github.com/apache/datafusion/pull/21674#issuecomment-4268410780 Benchmark for [this request](https://github.com/apache/datafusion/pull/21674#issuecomment-4268402463) failed. Last 20 lines of output: Click to expand ``` struct_query_sql substr substr_index substring sum to_char to_hex to_local_time to_time to_timestamp topk_aggregate topk_repartition translate trim trunc unhex upper uuid window_query_sql with_hashes ``` --- [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) against this benchmark runner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]
adriangbot commented on PR #21674: URL: https://github.com/apache/datafusion/pull/21674#issuecomment-4268410080 π€ Criterion benchmark running (GKE) | [trigger](https://github.com/apache/datafusion/pull/21674#issuecomment-4268402463) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux bench-c4268402463-1421-h7zl5 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux` CPU Details (lscpu) ``` Architecture:aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 16 Socket(s): - Cluster(s): 1 Stepping:r0p1 BogoMIPS:2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache:32 MiB (16 instances) L3 cache:80 MiB (1 instance) NUMA node(s):1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling:Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1:Mitigation; __user pointer sanitization Vulnerability Spectre v2:Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected ``` Comparing feat/sort-pushdown-inexact-benchmark (f1215b9659947b3f1579b592dbff82e747fb3638) to 7bfa3fb (merge-base) [diff](https://github.com/apache/datafusion/compare/7bfa3fbdf8e0eedfacb844048b3ee764155a1670..f1215b9659947b3f1579b592dbff82e747fb3638) BENCH_NAME=sort_pushdown_inexact_overlap BENCH_COMMAND=cargo bench --features=parquet --bench sort_pushdown_inexact_overlap BENCH_FILTER= Results will be posted here when complete --- [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) against this benchmark runner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]
adriangb commented on PR #21674: URL: https://github.com/apache/datafusion/pull/21674#issuecomment-4268402463 run benchmark sort_pushdown_inexact_overlap -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]
zhuqi-lucas commented on code in PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#discussion_r3098830995
##
benchmarks/bench.sh:
##
@@ -1137,6 +1144,62 @@ run_sort_pushdown_sorted() {
debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted
--iterations 5 --path "${SORT_PUSHDOWN_DIR}" --queries-path
"${SCRIPT_DIR}/queries/sort_pushdown" -o "${RESULTS_FILE}" ${QUERY_ARG}
${LATENCY_ARG}
}
+# Generates data for sort pushdown Inexact benchmark.
+#
+# Produces a single large lineitem parquet file where row groups have
+# NON-OVERLAPPING but OUT-OF-ORDER l_orderkey ranges (each RG internally
+# sorted, RGs shuffled). This simulates append-heavy workloads where data
+# is written in batches at different times.
+data_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact/lineitem"
+if [ -d "${INEXACT_DIR}" ] && [ "$(ls -A ${INEXACT_DIR}/*.parquet
2>/dev/null)" ]; then
+echo "Sort pushdown Inexact data already exists at ${INEXACT_DIR}"
+return
+fi
+
+echo "Generating sort pushdown Inexact benchmark data (single file,
shuffled RGs)..."
+
+# Re-use the sort_pushdown data as the source (generate if missing)
+data_sort_pushdown
+
+mkdir -p "${INEXACT_DIR}"
+SRC_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+
+# Use datafusion-cli to bucket rows into 64 groups by a deterministic
+# scrambler, then sort within each bucket by orderkey. This produces
+# ~64 RG-sized segments where each has a tight orderkey range but the
+# segments appear in scrambled (non-sorted) order in the file.
Review Comment:
Added in the latest push β `sort_pushdown_inexact_overlap` generates a file
with partially overlapping RGs (Β±2500 orderkey jitter between adjacent 100K-row
chunks, simulating streaming data with network delays). 4 DESC LIMIT queries to
match your use case.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]
adriangb commented on code in PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#discussion_r3094162913
##
benchmarks/bench.sh:
##
@@ -1137,6 +1144,62 @@ run_sort_pushdown_sorted() {
debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted
--iterations 5 --path "${SORT_PUSHDOWN_DIR}" --queries-path
"${SCRIPT_DIR}/queries/sort_pushdown" -o "${RESULTS_FILE}" ${QUERY_ARG}
${LATENCY_ARG}
}
+# Generates data for sort pushdown Inexact benchmark.
+#
+# Produces a single large lineitem parquet file where row groups have
+# NON-OVERLAPPING but OUT-OF-ORDER l_orderkey ranges (each RG internally
+# sorted, RGs shuffled). This simulates append-heavy workloads where data
+# is written in batches at different times.
+data_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact/lineitem"
+if [ -d "${INEXACT_DIR}" ] && [ "$(ls -A ${INEXACT_DIR}/*.parquet
2>/dev/null)" ]; then
+echo "Sort pushdown Inexact data already exists at ${INEXACT_DIR}"
+return
+fi
+
+echo "Generating sort pushdown Inexact benchmark data (single file,
shuffled RGs)..."
+
+# Re-use the sort_pushdown data as the source (generate if missing)
+data_sort_pushdown
+
+mkdir -p "${INEXACT_DIR}"
+SRC_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+
+# Use datafusion-cli to bucket rows into 64 groups by a deterministic
+# scrambler, then sort within each bucket by orderkey. This produces
+# ~64 RG-sized segments where each has a tight orderkey range but the
+# segments appear in scrambled (non-sorted) order in the file.
Review Comment:
Amazing thanks so much
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]
zhuqi-lucas commented on code in PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#discussion_r3094084842
##
benchmarks/bench.sh:
##
@@ -1137,6 +1144,62 @@ run_sort_pushdown_sorted() {
debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted
--iterations 5 --path "${SORT_PUSHDOWN_DIR}" --queries-path
"${SCRIPT_DIR}/queries/sort_pushdown" -o "${RESULTS_FILE}" ${QUERY_ARG}
${LATENCY_ARG}
}
+# Generates data for sort pushdown Inexact benchmark.
+#
+# Produces a single large lineitem parquet file where row groups have
+# NON-OVERLAPPING but OUT-OF-ORDER l_orderkey ranges (each RG internally
+# sorted, RGs shuffled). This simulates append-heavy workloads where data
+# is written in batches at different times.
+data_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact/lineitem"
+if [ -d "${INEXACT_DIR}" ] && [ "$(ls -A ${INEXACT_DIR}/*.parquet
2>/dev/null)" ]; then
+echo "Sort pushdown Inexact data already exists at ${INEXACT_DIR}"
+return
+fi
+
+echo "Generating sort pushdown Inexact benchmark data (single file,
shuffled RGs)..."
+
+# Re-use the sort_pushdown data as the source (generate if missing)
+data_sort_pushdown
+
+mkdir -p "${INEXACT_DIR}"
+SRC_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+
+# Use datafusion-cli to bucket rows into 64 groups by a deterministic
+# scrambler, then sort within each bucket by orderkey. This produces
+# ~64 RG-sized segments where each has a tight orderkey range but the
+# segments appear in scrambled (non-sorted) order in the file.
Review Comment:
Will add it in this PR β I'll create a third data variant with partially
overlapping RGs (simulating streaming data with network delays) alongside the
current shuffled data.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]
adriangb commented on code in PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#discussion_r3094036956
##
benchmarks/bench.sh:
##
@@ -1137,6 +1144,62 @@ run_sort_pushdown_sorted() {
debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted
--iterations 5 --path "${SORT_PUSHDOWN_DIR}" --queries-path
"${SCRIPT_DIR}/queries/sort_pushdown" -o "${RESULTS_FILE}" ${QUERY_ARG}
${LATENCY_ARG}
}
+# Generates data for sort pushdown Inexact benchmark.
+#
+# Produces a single large lineitem parquet file where row groups have
+# NON-OVERLAPPING but OUT-OF-ORDER l_orderkey ranges (each RG internally
+# sorted, RGs shuffled). This simulates append-heavy workloads where data
+# is written in batches at different times.
+data_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact/lineitem"
+if [ -d "${INEXACT_DIR}" ] && [ "$(ls -A ${INEXACT_DIR}/*.parquet
2>/dev/null)" ]; then
+echo "Sort pushdown Inexact data already exists at ${INEXACT_DIR}"
+return
+fi
+
+echo "Generating sort pushdown Inexact benchmark data (single file,
shuffled RGs)..."
+
+# Re-use the sort_pushdown data as the source (generate if missing)
+data_sort_pushdown
+
+mkdir -p "${INEXACT_DIR}"
+SRC_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+
+# Use datafusion-cli to bucket rows into 64 groups by a deterministic
+# scrambler, then sort within each bucket by orderkey. This produces
+# ~64 RG-sized segments where each has a tight orderkey range but the
+# segments appear in scrambled (non-sorted) order in the file.
Review Comment:
Did you plan to include it in this PR or a followup?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]
zhuqi-lucas commented on code in PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#discussion_r3093910770
##
benchmarks/bench.sh:
##
@@ -1137,6 +1144,59 @@ run_sort_pushdown_sorted() {
debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted
--iterations 5 --path "${SORT_PUSHDOWN_DIR}" --queries-path
"${SCRIPT_DIR}/queries/sort_pushdown" -o "${RESULTS_FILE}" ${QUERY_ARG}
${LATENCY_ARG}
}
+# Generates data for sort pushdown Inexact benchmark.
+#
+# Produces a single large lineitem parquet file where row groups have
+# NON-OVERLAPPING but OUT-OF-ORDER l_orderkey ranges (each RG internally
+# sorted, RGs shuffled). This simulates append-heavy workloads where data
+# is written in batches at different times.
+data_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact/lineitem"
+if [ -d "${INEXACT_DIR}" ] && [ "$(ls -A ${INEXACT_DIR}/*.parquet
2>/dev/null)" ]; then
+echo "Sort pushdown Inexact data already exists at ${INEXACT_DIR}"
+return
+fi
+
+echo "Generating sort pushdown Inexact benchmark data (single file,
shuffled RGs)..."
+
+# Re-use the sort_pushdown data as the source (generate if missing)
+data_sort_pushdown
+
+mkdir -p "${INEXACT_DIR}"
+SRC_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+
+# Use datafusion-cli to bucket rows into 64 groups by a deterministic
+# scrambler, then sort within each bucket by orderkey. This produces
+# ~64 RG-sized segments where each has a tight orderkey range but the
+# segments appear in scrambled (non-sorted) order in the file.
+(cd "${SCRIPT_DIR}/.." && cargo run --release -p datafusion-cli -- -c "
+CREATE EXTERNAL TABLE src
+STORED AS PARQUET
+LOCATION '${SRC_DIR}';
+
+COPY (
+SELECT * FROM src
+ORDER BY
+(l_orderkey * 1664525 + 1013904223) % 64,
+l_orderkey
+)
+TO '${INEXACT_DIR}/shuffled.parquet'
+STORED AS PARQUET
+OPTIONS ('format.max_row_group_size' '10');
+")
+
+echo "Sort pushdown Inexact data generated at ${INEXACT_DIR}"
+ls -la "${INEXACT_DIR}"
+}
+
+# Runs the sort pushdown Inexact benchmark (tests RG reorder by statistics).
+run_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact"
+RESULTS_FILE="${RESULTS_DIR}/sort_pushdown_inexact.json"
+echo "Running sort pushdown Inexact benchmark (row group reorder by
statistics)..."
+debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted
--iterations 5 --path "${INEXACT_DIR}" --queries-path
"${SCRIPT_DIR}/queries/sort_pushdown_inexact" -o "${RESULTS_FILE}" ${QUERY_ARG}
${LATENCY_ARG}
Review Comment:
`--sorted` adds `WITH ORDER (l_orderkey ASC)` which is needed to trigger the
reverse scan path in `try_pushdown_sort` β currently the only path where
`reorder_by_statistics` is called.
After reviewing this more carefully, I realized we need two benchmark suites
to cover the full optimization path:
1. `sort_pushdown_inexact` (with `--sorted`, DESC queries) β tests the
reverse scan path where RG reorder is already supported
2. `sort_pushdown_inexact_unsorted` (without `--sorted`, ASC+DESC queries) β
tests the Unsupported path where RG reorder will be supported in a follow-up PR
(#21580)
Updated in the latest push. This way each follow-up PR can run its
corresponding benchmark to show the improvement.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]
zhuqi-lucas commented on code in PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#discussion_r3093910770
##
benchmarks/bench.sh:
##
@@ -1137,6 +1144,59 @@ run_sort_pushdown_sorted() {
debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted
--iterations 5 --path "${SORT_PUSHDOWN_DIR}" --queries-path
"${SCRIPT_DIR}/queries/sort_pushdown" -o "${RESULTS_FILE}" ${QUERY_ARG}
${LATENCY_ARG}
}
+# Generates data for sort pushdown Inexact benchmark.
+#
+# Produces a single large lineitem parquet file where row groups have
+# NON-OVERLAPPING but OUT-OF-ORDER l_orderkey ranges (each RG internally
+# sorted, RGs shuffled). This simulates append-heavy workloads where data
+# is written in batches at different times.
+data_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact/lineitem"
+if [ -d "${INEXACT_DIR}" ] && [ "$(ls -A ${INEXACT_DIR}/*.parquet
2>/dev/null)" ]; then
+echo "Sort pushdown Inexact data already exists at ${INEXACT_DIR}"
+return
+fi
+
+echo "Generating sort pushdown Inexact benchmark data (single file,
shuffled RGs)..."
+
+# Re-use the sort_pushdown data as the source (generate if missing)
+data_sort_pushdown
+
+mkdir -p "${INEXACT_DIR}"
+SRC_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+
+# Use datafusion-cli to bucket rows into 64 groups by a deterministic
+# scrambler, then sort within each bucket by orderkey. This produces
+# ~64 RG-sized segments where each has a tight orderkey range but the
+# segments appear in scrambled (non-sorted) order in the file.
+(cd "${SCRIPT_DIR}/.." && cargo run --release -p datafusion-cli -- -c "
+CREATE EXTERNAL TABLE src
+STORED AS PARQUET
+LOCATION '${SRC_DIR}';
+
+COPY (
+SELECT * FROM src
+ORDER BY
+(l_orderkey * 1664525 + 1013904223) % 64,
+l_orderkey
+)
+TO '${INEXACT_DIR}/shuffled.parquet'
+STORED AS PARQUET
+OPTIONS ('format.max_row_group_size' '10');
+")
+
+echo "Sort pushdown Inexact data generated at ${INEXACT_DIR}"
+ls -la "${INEXACT_DIR}"
+}
+
+# Runs the sort pushdown Inexact benchmark (tests RG reorder by statistics).
+run_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact"
+RESULTS_FILE="${RESULTS_DIR}/sort_pushdown_inexact.json"
+echo "Running sort pushdown Inexact benchmark (row group reorder by
statistics)..."
+debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted
--iterations 5 --path "${INEXACT_DIR}" --queries-path
"${SCRIPT_DIR}/queries/sort_pushdown_inexact" -o "${RESULTS_FILE}" ${QUERY_ARG}
${LATENCY_ARG}
Review Comment:
`--sorted` adds `WITH ORDER (l_orderkey ASC)` which is needed to trigger the
reverse scan path in `try_pushdown_sort` β currently the only path where
`reorder_by_statistics` is called.
After reviewing this more carefully, I realized we need two benchmark suites
to cover the full optimization path:
1. `sort_pushdown_inexact` (with `--sorted`, DESC queries) β tests the
reverse scan path where RG reorder is already supported (#21580)
2. `sort_pushdown_inexact_unsorted` (without `--sorted`, ASC+DESC queries) β
tests the Unsupported path where RG reorder will be supported in a follow-up PR
Updated in the latest push. This way each follow-up PR can run its
corresponding benchmark to show the improvement.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]
zhuqi-lucas commented on code in PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#discussion_r3093847933
##
benchmarks/bench.sh:
##
@@ -1137,6 +1144,59 @@ run_sort_pushdown_sorted() {
debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted
--iterations 5 --path "${SORT_PUSHDOWN_DIR}" --queries-path
"${SCRIPT_DIR}/queries/sort_pushdown" -o "${RESULTS_FILE}" ${QUERY_ARG}
${LATENCY_ARG}
}
+# Generates data for sort pushdown Inexact benchmark.
+#
+# Produces a single large lineitem parquet file where row groups have
+# NON-OVERLAPPING but OUT-OF-ORDER l_orderkey ranges (each RG internally
+# sorted, RGs shuffled). This simulates append-heavy workloads where data
+# is written in batches at different times.
+data_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact/lineitem"
+if [ -d "${INEXACT_DIR}" ] && [ "$(ls -A ${INEXACT_DIR}/*.parquet
2>/dev/null)" ]; then
+echo "Sort pushdown Inexact data already exists at ${INEXACT_DIR}"
+return
+fi
+
+echo "Generating sort pushdown Inexact benchmark data (single file,
shuffled RGs)..."
+
+# Re-use the sort_pushdown data as the source (generate if missing)
+data_sort_pushdown
+
+mkdir -p "${INEXACT_DIR}"
+SRC_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+
+# Use datafusion-cli to bucket rows into 64 groups by a deterministic
+# scrambler, then sort within each bucket by orderkey. This produces
+# ~64 RG-sized segments where each has a tight orderkey range but the
+# segments appear in scrambled (non-sorted) order in the file.
+(cd "${SCRIPT_DIR}/.." && cargo run --release -p datafusion-cli -- -c "
+CREATE EXTERNAL TABLE src
+STORED AS PARQUET
+LOCATION '${SRC_DIR}';
+
+COPY (
+SELECT * FROM src
+ORDER BY
+(l_orderkey * 1664525 + 1013904223) % 64,
+l_orderkey
+)
+TO '${INEXACT_DIR}/shuffled.parquet'
+STORED AS PARQUET
+OPTIONS ('format.max_row_group_size' '10');
+")
+
+echo "Sort pushdown Inexact data generated at ${INEXACT_DIR}"
+ls -la "${INEXACT_DIR}"
+}
+
+# Runs the sort pushdown Inexact benchmark (tests RG reorder by statistics).
+run_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact"
+RESULTS_FILE="${RESULTS_DIR}/sort_pushdown_inexact.json"
+echo "Running sort pushdown Inexact benchmark (row group reorder by
statistics)..."
+debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted
--iterations 5 --path "${INEXACT_DIR}" --queries-path
"${SCRIPT_DIR}/queries/sort_pushdown_inexact" -o "${RESULTS_FILE}" ${QUERY_ARG}
${LATENCY_ARG}
Review Comment:
`--sorted` adds `WITH ORDER (l_orderkey ASC)` to the table, which is needed
to trigger the reverse scan path in `try_pushdown_sort` β currently the only
path where `reorder_by_statistics` is called. Without it, sort pushdown returns
Unsupported and reorder is not triggered.
I am planning follow-up PRs to:
1. Support exact reverse scan (eliminate SortExec entirely for DESC on
ASC-sorted files)
2. Enable RG reorder in the Unsupported path as well (no WITH ORDER needed)
Once those land, I will update this benchmark PR to cover all paths with
proper comparisons. Going to put this PR on hold for now.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]
zhuqi-lucas commented on code in PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#discussion_r3093849484
##
benchmarks/bench.sh:
##
@@ -1137,6 +1144,62 @@ run_sort_pushdown_sorted() {
debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted
--iterations 5 --path "${SORT_PUSHDOWN_DIR}" --queries-path
"${SCRIPT_DIR}/queries/sort_pushdown" -o "${RESULTS_FILE}" ${QUERY_ARG}
${LATENCY_ARG}
}
+# Generates data for sort pushdown Inexact benchmark.
+#
+# Produces a single large lineitem parquet file where row groups have
+# NON-OVERLAPPING but OUT-OF-ORDER l_orderkey ranges (each RG internally
+# sorted, RGs shuffled). This simulates append-heavy workloads where data
+# is written in batches at different times.
+data_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact/lineitem"
+if [ -d "${INEXACT_DIR}" ] && [ "$(ls -A ${INEXACT_DIR}/*.parquet
2>/dev/null)" ]; then
+echo "Sort pushdown Inexact data already exists at ${INEXACT_DIR}"
+return
+fi
+
+echo "Generating sort pushdown Inexact benchmark data (single file,
shuffled RGs)..."
+
+# Re-use the sort_pushdown data as the source (generate if missing)
+data_sort_pushdown
+
+mkdir -p "${INEXACT_DIR}"
+SRC_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+
+# Use datafusion-cli to bucket rows into 64 groups by a deterministic
+# scrambler, then sort within each bucket by orderkey. This produces
+# ~64 RG-sized segments where each has a tight orderkey range but the
+# segments appear in scrambled (non-sorted) order in the file.
Review Comment:
Great suggestion! Partially overlapping RGs from streaming data is a very
realistic scenario. I will add a benchmark variant for this pattern when I
update the PR β something like time-ordered chunks with small overlaps between
adjacent chunks to simulate network delays / time skew.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]
zhuqi-lucas commented on code in PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#discussion_r3093847933
##
benchmarks/bench.sh:
##
@@ -1137,6 +1144,59 @@ run_sort_pushdown_sorted() {
debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted
--iterations 5 --path "${SORT_PUSHDOWN_DIR}" --queries-path
"${SCRIPT_DIR}/queries/sort_pushdown" -o "${RESULTS_FILE}" ${QUERY_ARG}
${LATENCY_ARG}
}
+# Generates data for sort pushdown Inexact benchmark.
+#
+# Produces a single large lineitem parquet file where row groups have
+# NON-OVERLAPPING but OUT-OF-ORDER l_orderkey ranges (each RG internally
+# sorted, RGs shuffled). This simulates append-heavy workloads where data
+# is written in batches at different times.
+data_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact/lineitem"
+if [ -d "${INEXACT_DIR}" ] && [ "$(ls -A ${INEXACT_DIR}/*.parquet
2>/dev/null)" ]; then
+echo "Sort pushdown Inexact data already exists at ${INEXACT_DIR}"
+return
+fi
+
+echo "Generating sort pushdown Inexact benchmark data (single file,
shuffled RGs)..."
+
+# Re-use the sort_pushdown data as the source (generate if missing)
+data_sort_pushdown
+
+mkdir -p "${INEXACT_DIR}"
+SRC_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+
+# Use datafusion-cli to bucket rows into 64 groups by a deterministic
+# scrambler, then sort within each bucket by orderkey. This produces
+# ~64 RG-sized segments where each has a tight orderkey range but the
+# segments appear in scrambled (non-sorted) order in the file.
+(cd "${SCRIPT_DIR}/.." && cargo run --release -p datafusion-cli -- -c "
+CREATE EXTERNAL TABLE src
+STORED AS PARQUET
+LOCATION '${SRC_DIR}';
+
+COPY (
+SELECT * FROM src
+ORDER BY
+(l_orderkey * 1664525 + 1013904223) % 64,
+l_orderkey
+)
+TO '${INEXACT_DIR}/shuffled.parquet'
+STORED AS PARQUET
+OPTIONS ('format.max_row_group_size' '10');
+")
+
+echo "Sort pushdown Inexact data generated at ${INEXACT_DIR}"
+ls -la "${INEXACT_DIR}"
+}
+
+# Runs the sort pushdown Inexact benchmark (tests RG reorder by statistics).
+run_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact"
+RESULTS_FILE="${RESULTS_DIR}/sort_pushdown_inexact.json"
+echo "Running sort pushdown Inexact benchmark (row group reorder by
statistics)..."
+debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted
--iterations 5 --path "${INEXACT_DIR}" --queries-path
"${SCRIPT_DIR}/queries/sort_pushdown_inexact" -o "${RESULTS_FILE}" ${QUERY_ARG}
${LATENCY_ARG}
Review Comment:
`--sorted` adds `WITH ORDER (l_orderkey ASC)` to the table, which is needed
to trigger the reverse scan path in `try_pushdown_sort` β currently the only
path where `reorder_by_statistics` is called. Without it, sort pushdown returns
Unsupported and reorder is not triggered.
I am planning follow-up PRs to:
1. Support exact reverse scan (eliminate SortExec entirely for DESC on
ASC-sorted files)
2. Enable RG reorder in the Unsupported path as well (no WITH ORDER needed)
Once those land, I will update this benchmark PR to cover all paths with
proper comparisons. Going to put this PR on hold for now.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]
adriangb commented on code in PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#discussion_r3093307815
##
benchmarks/bench.sh:
##
@@ -1137,6 +1144,59 @@ run_sort_pushdown_sorted() {
debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted
--iterations 5 --path "${SORT_PUSHDOWN_DIR}" --queries-path
"${SCRIPT_DIR}/queries/sort_pushdown" -o "${RESULTS_FILE}" ${QUERY_ARG}
${LATENCY_ARG}
}
+# Generates data for sort pushdown Inexact benchmark.
+#
+# Produces a single large lineitem parquet file where row groups have
+# NON-OVERLAPPING but OUT-OF-ORDER l_orderkey ranges (each RG internally
+# sorted, RGs shuffled). This simulates append-heavy workloads where data
+# is written in batches at different times.
+data_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact/lineitem"
+if [ -d "${INEXACT_DIR}" ] && [ "$(ls -A ${INEXACT_DIR}/*.parquet
2>/dev/null)" ]; then
+echo "Sort pushdown Inexact data already exists at ${INEXACT_DIR}"
+return
+fi
+
+echo "Generating sort pushdown Inexact benchmark data (single file,
shuffled RGs)..."
+
+# Re-use the sort_pushdown data as the source (generate if missing)
+data_sort_pushdown
+
+mkdir -p "${INEXACT_DIR}"
+SRC_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+
+# Use datafusion-cli to bucket rows into 64 groups by a deterministic
+# scrambler, then sort within each bucket by orderkey. This produces
+# ~64 RG-sized segments where each has a tight orderkey range but the
+# segments appear in scrambled (non-sorted) order in the file.
+(cd "${SCRIPT_DIR}/.." && cargo run --release -p datafusion-cli -- -c "
+CREATE EXTERNAL TABLE src
+STORED AS PARQUET
+LOCATION '${SRC_DIR}';
+
+COPY (
+SELECT * FROM src
+ORDER BY
+(l_orderkey * 1664525 + 1013904223) % 64,
+l_orderkey
+)
+TO '${INEXACT_DIR}/shuffled.parquet'
+STORED AS PARQUET
+OPTIONS ('format.max_row_group_size' '10');
+")
+
+echo "Sort pushdown Inexact data generated at ${INEXACT_DIR}"
+ls -la "${INEXACT_DIR}"
+}
+
+# Runs the sort pushdown Inexact benchmark (tests RG reorder by statistics).
+run_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact"
+RESULTS_FILE="${RESULTS_DIR}/sort_pushdown_inexact.json"
+echo "Running sort pushdown Inexact benchmark (row group reorder by
statistics)..."
+debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted
--iterations 5 --path "${INEXACT_DIR}" --queries-path
"${SCRIPT_DIR}/queries/sort_pushdown_inexact" -o "${RESULTS_FILE}" ${QUERY_ARG}
${LATENCY_ARG}
Review Comment:
I'm not sure what the `--sorted` flag is supposed to do but worth checking
@zhuqi-lucas
##
benchmarks/bench.sh:
##
@@ -1137,6 +1144,62 @@ run_sort_pushdown_sorted() {
debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted
--iterations 5 --path "${SORT_PUSHDOWN_DIR}" --queries-path
"${SCRIPT_DIR}/queries/sort_pushdown" -o "${RESULTS_FILE}" ${QUERY_ARG}
${LATENCY_ARG}
}
+# Generates data for sort pushdown Inexact benchmark.
+#
+# Produces a single large lineitem parquet file where row groups have
+# NON-OVERLAPPING but OUT-OF-ORDER l_orderkey ranges (each RG internally
+# sorted, RGs shuffled). This simulates append-heavy workloads where data
+# is written in batches at different times.
+data_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact/lineitem"
+if [ -d "${INEXACT_DIR}" ] && [ "$(ls -A ${INEXACT_DIR}/*.parquet
2>/dev/null)" ]; then
+echo "Sort pushdown Inexact data already exists at ${INEXACT_DIR}"
+return
+fi
+
+echo "Generating sort pushdown Inexact benchmark data (single file,
shuffled RGs)..."
+
+# Re-use the sort_pushdown data as the source (generate if missing)
+data_sort_pushdown
+
+mkdir -p "${INEXACT_DIR}"
+SRC_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+
+# Use datafusion-cli to bucket rows into 64 groups by a deterministic
+# scrambler, then sort within each bucket by orderkey. This produces
+# ~64 RG-sized segments where each has a tight orderkey range but the
+# segments appear in scrambled (non-sorted) order in the file.
Review Comment:
I think another interesting benchmark, which is our case at least, is when
there is overlap between the row groups but some general logic. E.g.:
```
rg,min,max
0,0,5
1,4,7
2,5,32
```
In our case, this happens because there's a stream of data that's coming in
with time stamps, and it should arrive at around the same time it was created,
but there are always some network delays, time skew, etc. that means that it's
not perfect. But data that arrived 30 minutes later is guaranteed to have
timestamps in a different range than data that arrive 30 minutes before it.
--
This is an autom
Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]
zhuqi-lucas commented on code in PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#discussion_r3093277104
##
benchmarks/src/sort_pushdown.rs:
##
@@ -159,7 +159,11 @@ impl RunOpt {
async fn benchmark_query(&self, query_id: usize) ->
Result> {
let sql = self.load_query(query_id)?;
-let config = self.common.config()?;
+let mut config = self.common.config()?;
+// Enable filter pushdown so TopK's dynamic filter is pushed to the
+// parquet reader for late materialization β only sort-column rows
+// pass the filter, non-sort columns are skipped for filtered rows.
+config.options_mut().execution.parquet.pushdown_filters = true;
Review Comment:
Good catch @adriangb ! Fixed β reverted the sort_pushdown.rs change and
moved pushdown_filters to an env var only for the inexact benchmark run. Other
sort_pushdown benchmarks are unaffected.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]
zhuqi-lucas commented on code in PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#discussion_r3093277104
##
benchmarks/src/sort_pushdown.rs:
##
@@ -159,7 +159,11 @@ impl RunOpt {
async fn benchmark_query(&self, query_id: usize) ->
Result> {
let sql = self.load_query(query_id)?;
-let config = self.common.config()?;
+let mut config = self.common.config()?;
+// Enable filter pushdown so TopK's dynamic filter is pushed to the
+// parquet reader for late materialization β only sort-column rows
+// pass the filter, non-sort columns are skipped for filtered rows.
+config.options_mut().execution.parquet.pushdown_filters = true;
Review Comment:
Good catch! Fixed β reverted the sort_pushdown.rs change and moved
pushdown_filters to an env var only for the inexact benchmark run. Other
sort_pushdown benchmarks are unaffected.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]
adriangb commented on code in PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#discussion_r3093233786
##
benchmarks/src/sort_pushdown.rs:
##
@@ -159,7 +159,11 @@ impl RunOpt {
async fn benchmark_query(&self, query_id: usize) ->
Result> {
let sql = self.load_query(query_id)?;
-let config = self.common.config()?;
+let mut config = self.common.config()?;
+// Enable filter pushdown so TopK's dynamic filter is pushed to the
+// parquet reader for late materialization β only sort-column rows
+// pass the filter, non-sort columns are skipped for filtered rows.
+config.options_mut().execution.parquet.pushdown_filters = true;
Review Comment:
Does this enable it for all benchmarks? Generally we control this on a
per-run basis.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: add sort_pushdown_inexact benchmark for RG reorder [datafusion]
Copilot commented on code in PR #21674:
URL: https://github.com/apache/datafusion/pull/21674#discussion_r3093205802
##
benchmarks/bench.sh:
##
@@ -1137,6 +1144,59 @@ run_sort_pushdown_sorted() {
debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted
--iterations 5 --path "${SORT_PUSHDOWN_DIR}" --queries-path
"${SCRIPT_DIR}/queries/sort_pushdown" -o "${RESULTS_FILE}" ${QUERY_ARG}
${LATENCY_ARG}
}
+# Generates data for sort pushdown Inexact benchmark.
+#
+# Produces a single large lineitem parquet file where row groups have
+# NON-OVERLAPPING but OUT-OF-ORDER l_orderkey ranges (each RG internally
+# sorted, RGs shuffled). This simulates append-heavy workloads where data
+# is written in batches at different times.
+data_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact/lineitem"
+if [ -d "${INEXACT_DIR}" ] && [ "$(ls -A ${INEXACT_DIR}/*.parquet
2>/dev/null)" ]; then
+echo "Sort pushdown Inexact data already exists at ${INEXACT_DIR}"
+return
+fi
+
+echo "Generating sort pushdown Inexact benchmark data (single file,
shuffled RGs)..."
+
+# Re-use the sort_pushdown data as the source (generate if missing)
+data_sort_pushdown
+
+mkdir -p "${INEXACT_DIR}"
+SRC_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+
+# Use datafusion-cli to bucket rows into 64 groups by a deterministic
+# scrambler, then sort within each bucket by orderkey. This produces
+# ~64 RG-sized segments where each has a tight orderkey range but the
+# segments appear in scrambled (non-sorted) order in the file.
+(cd "${SCRIPT_DIR}/.." && cargo run --release -p datafusion-cli -- -c "
Review Comment:
This uses `cargo run --release -p datafusion-cli` directly, which bypasses
the scriptβs `CARGO_COMMAND` override (often used to add
flags/features/profiles) and also skips the `debug_run` wrapper used elsewhere
for reproducibility. Consider invoking datafusion-cli via the same configurable
mechanism (e.g., reusing `CARGO_COMMAND` or building the CLI once and executing
the binary) so benchmark data generation behaves consistently across
environments.
```suggestion
(cd "${SCRIPT_DIR}/.." && debug_run ${CARGO_COMMAND} -p datafusion-cli
-- -c "
```
##
benchmarks/bench.sh:
##
@@ -1137,6 +1144,59 @@ run_sort_pushdown_sorted() {
debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted
--iterations 5 --path "${SORT_PUSHDOWN_DIR}" --queries-path
"${SCRIPT_DIR}/queries/sort_pushdown" -o "${RESULTS_FILE}" ${QUERY_ARG}
${LATENCY_ARG}
}
+# Generates data for sort pushdown Inexact benchmark.
+#
+# Produces a single large lineitem parquet file where row groups have
+# NON-OVERLAPPING but OUT-OF-ORDER l_orderkey ranges (each RG internally
+# sorted, RGs shuffled). This simulates append-heavy workloads where data
+# is written in batches at different times.
+data_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact/lineitem"
+if [ -d "${INEXACT_DIR}" ] && [ "$(ls -A ${INEXACT_DIR}/*.parquet
2>/dev/null)" ]; then
+echo "Sort pushdown Inexact data already exists at ${INEXACT_DIR}"
+return
+fi
+
+echo "Generating sort pushdown Inexact benchmark data (single file,
shuffled RGs)..."
+
+# Re-use the sort_pushdown data as the source (generate if missing)
+data_sort_pushdown
+
+mkdir -p "${INEXACT_DIR}"
+SRC_DIR="${DATA_DIR}/sort_pushdown/lineitem"
+
+# Use datafusion-cli to bucket rows into 64 groups by a deterministic
+# scrambler, then sort within each bucket by orderkey. This produces
+# ~64 RG-sized segments where each has a tight orderkey range but the
+# segments appear in scrambled (non-sorted) order in the file.
+(cd "${SCRIPT_DIR}/.." && cargo run --release -p datafusion-cli -- -c "
+CREATE EXTERNAL TABLE src
+STORED AS PARQUET
+LOCATION '${SRC_DIR}';
+
+COPY (
+SELECT * FROM src
+ORDER BY
+(l_orderkey * 1664525 + 1013904223) % 64,
+l_orderkey
+)
+TO '${INEXACT_DIR}/shuffled.parquet'
+STORED AS PARQUET
+OPTIONS ('format.max_row_group_size' '10');
+")
+
+echo "Sort pushdown Inexact data generated at ${INEXACT_DIR}"
+ls -la "${INEXACT_DIR}"
+}
+
+# Runs the sort pushdown Inexact benchmark (tests RG reorder by statistics).
+run_sort_pushdown_inexact() {
+INEXACT_DIR="${DATA_DIR}/sort_pushdown_inexact"
+RESULTS_FILE="${RESULTS_DIR}/sort_pushdown_inexact.json"
+echo "Running sort pushdown Inexact benchmark (row group reorder by
statistics)..."
+debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted
--iterations 5 --path "${INEXACT_DIR}" --queries-path
"${SCRIPT_DIR}/queries/sort_pushdown_inexact" -o "${RESULTS_FILE}" ${QUERY_ARG}
${LATENCY_ARG}
Review Comment:
`run_sort_pushdown_inexact` passes `--sorted`, which tells DataFusion
