Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-05-12 Thread via GitHub


mbutrovich closed pull request #21629: perf: Coalesce batches before sorting in 
ExternalSorter to reduce merge fan-in
URL: https://github.com/apache/datafusion/pull/21629


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-16 Thread via GitHub


gratus00 commented on code in PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#discussion_r3097558112


##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -402,381 +496,199 @@ impl ExternalSorter {
 self.metrics.spill_metrics.spill_file_count.value()
 }
 
-/// Appending globally sorted batches to the in-progress spill file, and 
clears
-/// the `globally_sorted_batches` (also its memory reservation) afterwards.
-async fn consume_and_spill_append(
-&mut self,
-globally_sorted_batches: &mut Vec,
-) -> Result<()> {
-if globally_sorted_batches.is_empty() {
-return Ok(());
-}
-
-// Lazily initialize the in-progress spill file
-if self.in_progress_spill_file.is_none() {
-self.in_progress_spill_file =
-Some((self.spill_manager.create_in_progress_file("Sorting")?, 
0));
-}
-
-Self::organize_stringview_arrays(globally_sorted_batches)?;
-
-debug!("Spilling sort data of ExternalSorter to disk whilst 
inserting");
-
-let batches_to_spill = std::mem::take(globally_sorted_batches);
-self.reservation.free();
-
-let (in_progress_file, max_record_batch_size) =
-self.in_progress_spill_file.as_mut().ok_or_else(|| {
-internal_datafusion_err!("In-progress spill file should be 
initialized")
-})?;
-
-for batch in batches_to_spill {
-in_progress_file.append_batch(&batch)?;
-
-*max_record_batch_size =
-(*max_record_batch_size).max(batch.get_sliced_size()?);
-}
-
-assert_or_internal_err!(
-globally_sorted_batches.is_empty(),
-"This function consumes globally_sorted_batches, so it should be 
empty after taking."
-);
-
-Ok(())
-}
-
-/// Finishes the in-progress spill file and moves it to the finished spill 
files.
-async fn spill_finish(&mut self) -> Result<()> {
-let (mut in_progress_file, max_record_batch_memory) =
-self.in_progress_spill_file.take().ok_or_else(|| {
-internal_datafusion_err!("Should be called after 
`spill_append`")
-})?;
-let spill_file = in_progress_file.finish()?;
-
-if let Some(spill_file) = spill_file {
-self.finished_spill_files.push(SortedSpillFile {
-file: spill_file,
-max_record_batch_memory,
-});
-}
-
-Ok(())
-}
-
-/// Reconstruct `globally_sorted_batches` to organize the payload buffers 
of each
-/// `StringViewArray` in sequential order by calling `gc()` on them.
-///
-/// Note this is a workaround until 
 is
-/// available
-///
-/// # Rationale
-/// After (merge-based) sorting, all batches will be sorted into a single 
run,
-/// but physically this sorted run is chunked into many small batches. For
-/// `StringViewArray`s inside each sorted run, their inner buffers are not
-/// re-constructed by default, leading to non-sequential payload locations
-/// (permutated by `interleave()` Arrow kernel). A single payload buffer 
might
-/// be shared by multiple `RecordBatch`es.
-/// When writing each batch to disk, the writer has to write all 
referenced buffers,
-/// because they have to be read back one by one to reduce memory usage. 
This
-/// causes extra disk reads and writes, and potentially execution failure.
+/// Spills sorted runs to disk.
 ///
-/// # Example
-/// Before sorting:
-/// batch1 -> buffer1
-/// batch2 -> buffer2
+/// Two strategies depending on available merge headroom:
 ///
-/// sorted_batch1 -> buffer1
-///   -> buffer2
-/// sorted_batch2 -> buffer1
-///   -> buffer2
+/// - **With headroom** (`merge_reservation > 0`): merge all runs into
+///   a single globally-sorted stream, then write to one spill file.
+///   Fewer spill files = lower fan-in for the final MultiLevelMerge.
 ///
-/// Then when spilling each batch, the writer has to write all referenced 
buffers
-/// repeatedly.
-fn organize_stringview_arrays(
-globally_sorted_batches: &mut Vec,
-) -> Result<()> {
-let mut organized_batches = 
Vec::with_capacity(globally_sorted_batches.len());
-
-for batch in globally_sorted_batches.drain(..) {
-let mut new_columns: Vec> =
-Vec::with_capacity(batch.num_columns());
+/// - **Without headroom** (`merge_reservation == 0`): spill each run
+///   as its own file. Avoids allocating merge cursor infrastructure
+///   when the pool has no room. MultiLevelMerge handles the higher
+///   fan-in with dynamic memory management.
+async fn spill_sorted_runs(&mut self) -> Result<()> {
+assert_or_internal_err!(
+s

Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-16 Thread via GitHub


mbutrovich commented on PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4262519681

   More brainstorming with Claude:
   
   > ### Proposed fix
   > 
   > Don't coalesce schemas that have variable-length non-view columns 
(StringArray, BinaryArray, LargeStringArray, LargeBinaryArray, DictionaryArray 
with string/binary values). For these schemas, set the coalesce target to 
`batch_size` instead of `sort_coalesce_target_rows`:
   > 
   > - Full-size batches pass through uncoalesced and sort individually, same 
as `main` — no regression
   > - Small batches (from filters, joins, pushdown) coalesce up to 
`batch_size` only — some fan-in reduction without blowing cache
   > - StringView and fixed-width schemas keep the full 32K coalesce target and 
retain all speedups
   > 
   > Fan-in reduction is the price we pay for StringArray schemas. We can't 
optimize the merge if the sort itself is 2.5x slower getting there.
   > 
   > Comet does not support StringView — all string columns use StringArray, so 
this regression directly affects Comet workloads.
   > 
   > ### Why check all columns, not just sort keys?
   > 
   > `take` reorders all columns in the batch, not just sort keys. Wide schemas 
(e.g. small string key + 70KB string value) are especially vulnerable — 
coalescing to 32K means `take` scatter-gathers across ~2.2GB of value data.
   > 
   > ### Alternatives considered
   > 
   > **Smaller intermediate target (e.g. 16K):** Any coalescing beyond 
`batch_size` amplifies cache pressure for wide schemas proportionally. No clear 
sweet spot.
   > 
   > **Convert StringArray → StringView before take:** The gather (conversion) 
is cheap and `take` on views is cheap, but the resulting views point to the 
original unsorted buffer. No memory freed until the entire buffer drops — ~2x 
memory bloat, wrong tradeoff for a spilling sort.
   > 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-16 Thread via GitHub


mbutrovich commented on PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4261197731

   Marking as draft while we explore performance trade-offs. I don't want to 
accidentally merge this yet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-16 Thread via GitHub


mbutrovich commented on PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4261195635

   ## Benchmark Analysis: `externalsorter` vs `main`
   
   Ran `cargo bench --bench sort -p datafusion` on both branches.
   
   ### Improvements (numeric, StringView, single-column string)
   
   | Benchmark | externalsorter | main | Speedup |
   |:---|---:|---:|---:|
   | sort i64 100k | 2.40 ms | 3.88 ms | 1.61x |
   | sort f64 100k | 2.61 ms | 3.94 ms | 1.51x |
   | sort merge i64 1M | 26.6 ms | 38.5 ms | 1.45x |
   | sort i64 1M | 33.8 ms | 45.5 ms | 1.35x |
   | sort f64 1M | 35.8 ms | 46.4 ms | 1.30x |
   | sort utf8 high cardinality 1M | 59.7 ms | 74.0 ms | 1.24x |
   | sort utf8 view tuple 1M | 8.1 ms | 9.1 ms | 1.13x |
   
   ### Regressions (multi-column StringArray and Dictionary)
   
   | Benchmark | externalsorter | main | Slowdown |
   |:---|---:|---:|---:|
   | sort utf8 tuple 100k (3x StringArray) | 23.7 ms | 9.5 ms | 2.5x |
   | sort utf8 tuple 1M (3x StringArray) | 272 ms | 111 ms | 2.5x |
   | sort mixed dictionary tuple 1M (3x dict + i64) | 283 ms | 108 ms | 2.6x |
   | sort utf8 dictionary tuple 1M (3x dict) | 92.4 ms | 78.5 ms | 1.18x |
   
   ### Root cause
   
   The coalesce-then-sort pipeline does: coalesce (copy all column data) -> 
`lexsort_to_indices` -> `take` (random-access scatter-gather) -> chunk back to 
`batch_size`. With multiple StringArray or Dictionary columns at 32K rows, the 
`take` scatter-gathers across ~1.9 MB of string heap data, exceeding L2 cache.
   
   Schemas that **don't** regress help pinpoint the cause:
   - **Single StringArray** (e.g. `sort utf8 high cardinality 1M`, +1.24x): one 
~640KB string buffer fits in L2 during `take`
   - **StringViewArray** (e.g. `sort utf8 view tuple 1M`, +1.13x): `take` 
copies fixed-size 16-byte view structs — no heap random access
   - **Fixed-width** (i64, f64): coalesce is memcpy, `take` is sequential — all 
cache-friendly
   
   ### Benchmark batch size
   
   The sort benchmark uses `BATCH_SIZE = 1024`, while DataFusion's default is 
8192. This makes the coalesce expand 32x (1024 -> 32768) instead of 4x (8192 -> 
32768), amplifying the copy cost. Small batches are a valid scenario though — 
filters, joins, and sources with pushdown can all produce sub-8192 batches.
   
   ### Proposed fix
   
   Arrow's `BatchCoalescer` has a 
`with_biggest_coalesce_batch_size(Some(limit))` option: batches larger than 
`limit` bypass coalescing and pass through directly. For schemas with non-view 
variable-length columns (StringArray, BinaryArray, DictionaryArray), set this 
to `batch_size`. This way:
   - Full-size batches (>= 8192 rows) pass through without string copying — 
sorted directly as individual runs
   - Small batches still coalesce to reduce fan-in
   - Fixed-width and StringView schemas keep the full 32K coalesce target 
unchanged
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-16 Thread via GitHub


mbutrovich commented on PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4259841218

   > Would be nice to see results in `sort` bench / `sort_tpch` etc. to see if 
we didnt have any regressions.
   
   I’ll try to run locally with the scaled up `sort` benchmark. We may want to 
consider scaling some of the scenarios independently if 1M is too big to apply 
across the board for the bot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-15 Thread via GitHub


Dandandan commented on PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4257501222

   Would be nice to see results in `sort` bench / `sort_tpch` etc. to see if we 
didnt have any regressions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-15 Thread via GitHub


Dandandan commented on code in PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#discussion_r3090863657


##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -95,116 +93,76 @@ impl ExternalSorterMetrics {
 ///
 /// # Algorithm
 ///
-/// 1. get a non-empty new batch from input
+/// Incoming batches are coalesced via [`BatchCoalescer`] to
+/// `sort_coalesce_target_rows` (default 32768) before sorting. This
+/// reduces merge fan-in by producing fewer, larger sorted runs.
 ///
-/// 2. check with the memory manager there is sufficient space to
-///buffer the batch in memory.
+/// Each coalesced batch is sorted immediately and stored as a
+/// pre-sorted run.
 ///
-/// 2.1 if memory is sufficient, buffer batch in memory, go to 1.
+/// 1. For each incoming batch:
+///- Reserve memory (2x batch size). If reservation fails, flush
+///  the coalescer, spill all sorted runs to disk, then retry.
+///- Push batch into the coalescer.
+///- If the coalescer reached its target: sort the coalesced batch
+///  and store as a new sorted run.
 ///
-/// 2.2 if no more memory is available, sort all buffered batches and
-/// spill to file.  buffer the next batch in memory, go to 1.
-///
-/// 3. when input is exhausted, merge all in memory batches and spills
-///to get a total order.
+/// 2. When input is exhausted, merge all sorted runs (and any spill
+///files) to produce a total order.
 ///
 /// # When data fits in available memory
 ///
-/// If there is sufficient memory, data is sorted in memory to produce the 
output
+/// Sorted runs are merged in memory using a loser-tree k-way merge
+/// (via [`StreamingMergeBuilder`]).
 ///
 /// ```text
-///┌─┐
-///│  2  │
-///│  3  │
-///│  1  │─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
-///│  4  │
-///│  2  │  │
-///└─┘  ▼
-///┌─┐
-///│  1  │  In memory
-///│  4  │─ ─ ─ ─ ─ ─▶ sort/merge  ─ ─ ─ ─ ─▶  total sorted output
-///│  1  │
-///└─┘  ▲
-///  ...│
-///
-///┌─┐  │
-///│  4  │
-///│  3  │─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
-///└─┘
-///
-/// in_mem_batches
+///   ┌──┐ ┌┐ ┌──┐ ┌┐
+///   │ Incoming │▶│  Batch │▶│ Sort │▶│ Sorted Run │
+///   │ Batches  │ │ Coalescer  │ │  │ │ (in memory)│
+///   └──┘ └┘ └──┘ └─┬──┘
+///  │
+///   ┌──┘
+///   ▼
+///k-way merge (loser tree)
+///   │
+///   ▼
+///total sorted output
 /// ```
 ///
 /// # When data does not fit in available memory
 ///
-///  When memory is exhausted, data is first sorted and written to one
-///  or more spill files on disk:
-///
-/// ```text
-///┌─┐   .─.
-///│  2  │  (   )
-///│  3  │  │`─'│
-///│  1  │─ ─ ─ ─ ─ ─ ─ │  ┌┐   │
-///│  4  │ ││  │ 1  │░  │
-///│  2  │  │  │... │░  │
-///└─┘ ▼│  │ 4  │░  ┌ ─ ─   │
-///┌─┐  │  └┘░1  │░ │
-///│  1  │ In memory│   ░░  │░░ │
-///│  4  │─ ─ ▶   sort/merge─ ─ ─ ─ ┼ ─ ─ ─ ─ ─▶ ... │░ │
-///│  1  │ and write to file│   │░░ │
-///└─┘  │ 4  │░ │
-///  ...   ▲│   └░─░─░░ │
-///││░░ │
-///┌─┐  │.─.│
-///│  4  │ │(   )
-///│  3  │─ ─ ─ ─ ─ ─ ─  `─'
-///└─┘
-///
-/// in_mem_batches  spills
-/// (file on disk in Arrow
-///   IPC format)
-/// ```
-///
-/// Once the input is completely read, the spill files are read and
-/// merged with any in memory batches to produce a single total sorted
-/// output:
+/// When memory is exhausted, sorted runs are spilled directly to disk
+/// (one spill file per run — no merge needed since runs are already
+/// sorted). `MultiLevelMergeBuilder` handles the final merge from disk
+/// with dynamic fan-in.
 ///
 /// ```text
-///   .─.
-///  (   )
-///  │`─'│
-///  │  ┌┐   │

Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-15 Thread via GitHub


adriangbot commented on PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4255675609

   🤖 Criterion benchmark running (GKE) | 
[trigger](https://github.com/apache/datafusion/pull/21629#issuecomment-4255649635)
   **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux 
bench-c4255649635-1332-2jl79 6.12.55+ #1 SMP Sun Feb  1 08:59:41 UTC 2026 
aarch64 GNU/Linux`
   CPU Details (lscpu)
   
   ```
   Architecture:aarch64
   CPU op-mode(s):  64-bit
   Byte Order:  Little Endian
   CPU(s):  16
   On-line CPU(s) list: 0-15
   Vendor ID:   ARM
   Model name:  Neoverse-V2
   Model:   1
   Thread(s) per core:  1
   Core(s) per cluster: 16
   Socket(s):   -
   Cluster(s):  1
   Stepping:r0p1
   BogoMIPS:2000.00
   Flags:   fp asimd evtstrm aes pmull sha1 
sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 
sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 
sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm 
bf16 dgh rng bti
   L1d cache:   1 MiB (16 instances)
   L1i cache:   1 MiB (16 instances)
   L2 cache:32 MiB (16 instances)
   L3 cache:80 MiB (1 instance)
   NUMA node(s):1
   NUMA node0 CPU(s):   0-15
   Vulnerability Gather data sampling:  Not affected
   Vulnerability Indirect target selection: Not affected
   Vulnerability Itlb multihit: Not affected
   Vulnerability L1tf:  Not affected
   Vulnerability Mds:   Not affected
   Vulnerability Meltdown:  Not affected
   Vulnerability Mmio stale data:   Not affected
   Vulnerability Reg file data sampling:Not affected
   Vulnerability Retbleed:  Not affected
   Vulnerability Spec rstack overflow:  Not affected
   Vulnerability Spec store bypass: Mitigation; Speculative Store 
Bypass disabled via prctl
   Vulnerability Spectre v1:Mitigation; __user pointer 
sanitization
   Vulnerability Spectre v2:Mitigation; CSV2, BHB
   Vulnerability Srbds: Not affected
   Vulnerability Tsa:   Not affected
   Vulnerability Tsx async abort:   Not affected
   Vulnerability Vmscape:   Not affected
   ```
   
   
   
   Comparing externalsorter (01983025442464d834808a036564f709af58aaff) to 
5c653be (merge-base) 
[diff](https://github.com/apache/datafusion/compare/5c653bee5da64003915f6dfeb3da15759b091a8d..01983025442464d834808a036564f709af58aaff)
   BENCH_NAME=sort
   BENCH_COMMAND=cargo bench --features=parquet --bench sort
   BENCH_FILTER=
   Results will be posted here when complete
   
   ---
   [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) 
against this benchmark runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-15 Thread via GitHub


mbutrovich commented on PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4255649635

   run benchmark sort


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-15 Thread via GitHub


mbutrovich commented on code in PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#discussion_r3088742930


##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -95,116 +93,76 @@ impl ExternalSorterMetrics {
 ///
 /// # Algorithm
 ///
-/// 1. get a non-empty new batch from input
+/// Incoming batches are coalesced via [`BatchCoalescer`] to
+/// `sort_coalesce_target_rows` (default 32768) before sorting. This
+/// reduces merge fan-in by producing fewer, larger sorted runs.
 ///
-/// 2. check with the memory manager there is sufficient space to
-///buffer the batch in memory.
+/// Each coalesced batch is sorted immediately and stored as a
+/// pre-sorted run.
 ///
-/// 2.1 if memory is sufficient, buffer batch in memory, go to 1.
+/// 1. For each incoming batch:
+///- Reserve memory (2x batch size). If reservation fails, flush
+///  the coalescer, spill all sorted runs to disk, then retry.
+///- Push batch into the coalescer.
+///- If the coalescer reached its target: sort the coalesced batch
+///  and store as a new sorted run.
 ///
-/// 2.2 if no more memory is available, sort all buffered batches and
-/// spill to file.  buffer the next batch in memory, go to 1.
-///
-/// 3. when input is exhausted, merge all in memory batches and spills
-///to get a total order.
+/// 2. When input is exhausted, merge all sorted runs (and any spill
+///files) to produce a total order.
 ///
 /// # When data fits in available memory
 ///
-/// If there is sufficient memory, data is sorted in memory to produce the 
output
+/// Sorted runs are merged in memory using a loser-tree k-way merge
+/// (via [`StreamingMergeBuilder`]).
 ///
 /// ```text
-///┌─┐
-///│  2  │
-///│  3  │
-///│  1  │─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
-///│  4  │
-///│  2  │  │
-///└─┘  ▼
-///┌─┐
-///│  1  │  In memory
-///│  4  │─ ─ ─ ─ ─ ─▶ sort/merge  ─ ─ ─ ─ ─▶  total sorted output
-///│  1  │
-///└─┘  ▲
-///  ...│
-///
-///┌─┐  │
-///│  4  │
-///│  3  │─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
-///└─┘
-///
-/// in_mem_batches
+///   ┌──┐ ┌┐ ┌──┐ ┌┐
+///   │ Incoming │▶│  Batch │▶│ Sort │▶│ Sorted Run │
+///   │ Batches  │ │ Coalescer  │ │  │ │ (in memory)│
+///   └──┘ └┘ └──┘ └─┬──┘
+///  │
+///   ┌──┘
+///   ▼
+///k-way merge (loser tree)
+///   │
+///   ▼
+///total sorted output
 /// ```
 ///
 /// # When data does not fit in available memory
 ///
-///  When memory is exhausted, data is first sorted and written to one
-///  or more spill files on disk:
-///
-/// ```text
-///┌─┐   .─.
-///│  2  │  (   )
-///│  3  │  │`─'│
-///│  1  │─ ─ ─ ─ ─ ─ ─ │  ┌┐   │
-///│  4  │ ││  │ 1  │░  │
-///│  2  │  │  │... │░  │
-///└─┘ ▼│  │ 4  │░  ┌ ─ ─   │
-///┌─┐  │  └┘░1  │░ │
-///│  1  │ In memory│   ░░  │░░ │
-///│  4  │─ ─ ▶   sort/merge─ ─ ─ ─ ┼ ─ ─ ─ ─ ─▶ ... │░ │
-///│  1  │ and write to file│   │░░ │
-///└─┘  │ 4  │░ │
-///  ...   ▲│   └░─░─░░ │
-///││░░ │
-///┌─┐  │.─.│
-///│  4  │ │(   )
-///│  3  │─ ─ ─ ─ ─ ─ ─  `─'
-///└─┘
-///
-/// in_mem_batches  spills
-/// (file on disk in Arrow
-///   IPC format)
-/// ```
-///
-/// Once the input is completely read, the spill files are read and
-/// merged with any in memory batches to produce a single total sorted
-/// output:
+/// When memory is exhausted, sorted runs are spilled directly to disk
+/// (one spill file per run — no merge needed since runs are already
+/// sorted). `MultiLevelMergeBuilder` handles the final merge from disk
+/// with dynamic fan-in.
 ///
 /// ```text
-///   .─.
-///  (   )
-///  │`─'│
-///  │  ┌┐   │

Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-15 Thread via GitHub


Dandandan commented on code in PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#discussion_r3088547619


##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -323,56 +285,188 @@ impl ExternalSorter {
 self.reserve_memory_for_batch_and_maybe_spill(&input)
 .await?;
 
-self.in_mem_batches.push(input);
+let coalescer = self
+.coalescer
+.as_mut()
+.expect("coalescer must exist during insert phase");
+coalescer
+.push_batch(input)
+.map_err(|e| DataFusionError::ArrowError(Box::new(e), None))?;
+
+self.drain_completed_batches()?;
+
+Ok(())
+}
+
+/// Drains completed (full) batches from the coalescer, sorts each,
+/// and appends the sorted chunks to `sorted_runs`.
+fn drain_completed_batches(&mut self) -> Result<()> {
+// Collect completed batches first to avoid borrow conflict
+let mut completed = vec![];
+if let Some(coalescer) = self.coalescer.as_mut() {
+while let Some(batch) = coalescer.next_completed_batch() {
+completed.push(batch);
+}
+}
+for batch in &completed {
+self.sort_and_store_run(batch)?;

Review Comment:
   Nice - I wonder if a part of the speedup  comes from this (sorting batch 
immediately (while in CPU cache) instead of waiting to the end.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-15 Thread via GitHub


Dandandan commented on code in PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#discussion_r3088547619


##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -323,56 +285,188 @@ impl ExternalSorter {
 self.reserve_memory_for_batch_and_maybe_spill(&input)
 .await?;
 
-self.in_mem_batches.push(input);
+let coalescer = self
+.coalescer
+.as_mut()
+.expect("coalescer must exist during insert phase");
+coalescer
+.push_batch(input)
+.map_err(|e| DataFusionError::ArrowError(Box::new(e), None))?;
+
+self.drain_completed_batches()?;
+
+Ok(())
+}
+
+/// Drains completed (full) batches from the coalescer, sorts each,
+/// and appends the sorted chunks to `sorted_runs`.
+fn drain_completed_batches(&mut self) -> Result<()> {
+// Collect completed batches first to avoid borrow conflict
+let mut completed = vec![];
+if let Some(coalescer) = self.coalescer.as_mut() {
+while let Some(batch) = coalescer.next_completed_batch() {
+completed.push(batch);
+}
+}
+for batch in &completed {
+self.sort_and_store_run(batch)?;

Review Comment:
   Nice - I wonder if a part of the speedup  comes from this, sorting batch 
immediately (while in CPU cache) instead of waiting to the end.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-15 Thread via GitHub


mbutrovich commented on PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4254421073

   I fear my sort benchmark scale up might have killed the benchmark bot :(


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-15 Thread via GitHub


Dandandan commented on code in PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#discussion_r3088501037


##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -95,116 +93,76 @@ impl ExternalSorterMetrics {
 ///
 /// # Algorithm
 ///
-/// 1. get a non-empty new batch from input
+/// Incoming batches are coalesced via [`BatchCoalescer`] to
+/// `sort_coalesce_target_rows` (default 32768) before sorting. This
+/// reduces merge fan-in by producing fewer, larger sorted runs.
 ///
-/// 2. check with the memory manager there is sufficient space to
-///buffer the batch in memory.
+/// Each coalesced batch is sorted immediately and stored as a
+/// pre-sorted run.
 ///
-/// 2.1 if memory is sufficient, buffer batch in memory, go to 1.
+/// 1. For each incoming batch:
+///- Reserve memory (2x batch size). If reservation fails, flush
+///  the coalescer, spill all sorted runs to disk, then retry.
+///- Push batch into the coalescer.
+///- If the coalescer reached its target: sort the coalesced batch
+///  and store as a new sorted run.
 ///
-/// 2.2 if no more memory is available, sort all buffered batches and
-/// spill to file.  buffer the next batch in memory, go to 1.
-///
-/// 3. when input is exhausted, merge all in memory batches and spills
-///to get a total order.
+/// 2. When input is exhausted, merge all sorted runs (and any spill
+///files) to produce a total order.
 ///
 /// # When data fits in available memory
 ///
-/// If there is sufficient memory, data is sorted in memory to produce the 
output
+/// Sorted runs are merged in memory using a loser-tree k-way merge
+/// (via [`StreamingMergeBuilder`]).
 ///
 /// ```text
-///┌─┐
-///│  2  │
-///│  3  │
-///│  1  │─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
-///│  4  │
-///│  2  │  │
-///└─┘  ▼
-///┌─┐
-///│  1  │  In memory
-///│  4  │─ ─ ─ ─ ─ ─▶ sort/merge  ─ ─ ─ ─ ─▶  total sorted output
-///│  1  │
-///└─┘  ▲
-///  ...│
-///
-///┌─┐  │
-///│  4  │
-///│  3  │─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
-///└─┘
-///
-/// in_mem_batches
+///   ┌──┐ ┌┐ ┌──┐ ┌┐
+///   │ Incoming │▶│  Batch │▶│ Sort │▶│ Sorted Run │
+///   │ Batches  │ │ Coalescer  │ │  │ │ (in memory)│
+///   └──┘ └┘ └──┘ └─┬──┘
+///  │
+///   ┌──┘
+///   ▼
+///k-way merge (loser tree)
+///   │
+///   ▼
+///total sorted output
 /// ```
 ///
 /// # When data does not fit in available memory
 ///
-///  When memory is exhausted, data is first sorted and written to one
-///  or more spill files on disk:
-///
-/// ```text
-///┌─┐   .─.
-///│  2  │  (   )
-///│  3  │  │`─'│
-///│  1  │─ ─ ─ ─ ─ ─ ─ │  ┌┐   │
-///│  4  │ ││  │ 1  │░  │
-///│  2  │  │  │... │░  │
-///└─┘ ▼│  │ 4  │░  ┌ ─ ─   │
-///┌─┐  │  └┘░1  │░ │
-///│  1  │ In memory│   ░░  │░░ │
-///│  4  │─ ─ ▶   sort/merge─ ─ ─ ─ ┼ ─ ─ ─ ─ ─▶ ... │░ │
-///│  1  │ and write to file│   │░░ │
-///└─┘  │ 4  │░ │
-///  ...   ▲│   └░─░─░░ │
-///││░░ │
-///┌─┐  │.─.│
-///│  4  │ │(   )
-///│  3  │─ ─ ─ ─ ─ ─ ─  `─'
-///└─┘
-///
-/// in_mem_batches  spills
-/// (file on disk in Arrow
-///   IPC format)
-/// ```
-///
-/// Once the input is completely read, the spill files are read and
-/// merged with any in memory batches to produce a single total sorted
-/// output:
+/// When memory is exhausted, sorted runs are spilled directly to disk
+/// (one spill file per run — no merge needed since runs are already
+/// sorted). `MultiLevelMergeBuilder` handles the final merge from disk
+/// with dynamic fan-in.
 ///
 /// ```text
-///   .─.
-///  (   )
-///  │`─'│
-///  │  ┌┐   │

Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-15 Thread via GitHub


Dandandan commented on code in PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#discussion_r3088501037


##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -95,116 +93,76 @@ impl ExternalSorterMetrics {
 ///
 /// # Algorithm
 ///
-/// 1. get a non-empty new batch from input
+/// Incoming batches are coalesced via [`BatchCoalescer`] to
+/// `sort_coalesce_target_rows` (default 32768) before sorting. This
+/// reduces merge fan-in by producing fewer, larger sorted runs.
 ///
-/// 2. check with the memory manager there is sufficient space to
-///buffer the batch in memory.
+/// Each coalesced batch is sorted immediately and stored as a
+/// pre-sorted run.
 ///
-/// 2.1 if memory is sufficient, buffer batch in memory, go to 1.
+/// 1. For each incoming batch:
+///- Reserve memory (2x batch size). If reservation fails, flush
+///  the coalescer, spill all sorted runs to disk, then retry.
+///- Push batch into the coalescer.
+///- If the coalescer reached its target: sort the coalesced batch
+///  and store as a new sorted run.
 ///
-/// 2.2 if no more memory is available, sort all buffered batches and
-/// spill to file.  buffer the next batch in memory, go to 1.
-///
-/// 3. when input is exhausted, merge all in memory batches and spills
-///to get a total order.
+/// 2. When input is exhausted, merge all sorted runs (and any spill
+///files) to produce a total order.
 ///
 /// # When data fits in available memory
 ///
-/// If there is sufficient memory, data is sorted in memory to produce the 
output
+/// Sorted runs are merged in memory using a loser-tree k-way merge
+/// (via [`StreamingMergeBuilder`]).
 ///
 /// ```text
-///┌─┐
-///│  2  │
-///│  3  │
-///│  1  │─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
-///│  4  │
-///│  2  │  │
-///└─┘  ▼
-///┌─┐
-///│  1  │  In memory
-///│  4  │─ ─ ─ ─ ─ ─▶ sort/merge  ─ ─ ─ ─ ─▶  total sorted output
-///│  1  │
-///└─┘  ▲
-///  ...│
-///
-///┌─┐  │
-///│  4  │
-///│  3  │─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
-///└─┘
-///
-/// in_mem_batches
+///   ┌──┐ ┌┐ ┌──┐ ┌┐
+///   │ Incoming │▶│  Batch │▶│ Sort │▶│ Sorted Run │
+///   │ Batches  │ │ Coalescer  │ │  │ │ (in memory)│
+///   └──┘ └┘ └──┘ └─┬──┘
+///  │
+///   ┌──┘
+///   ▼
+///k-way merge (loser tree)
+///   │
+///   ▼
+///total sorted output
 /// ```
 ///
 /// # When data does not fit in available memory
 ///
-///  When memory is exhausted, data is first sorted and written to one
-///  or more spill files on disk:
-///
-/// ```text
-///┌─┐   .─.
-///│  2  │  (   )
-///│  3  │  │`─'│
-///│  1  │─ ─ ─ ─ ─ ─ ─ │  ┌┐   │
-///│  4  │ ││  │ 1  │░  │
-///│  2  │  │  │... │░  │
-///└─┘ ▼│  │ 4  │░  ┌ ─ ─   │
-///┌─┐  │  └┘░1  │░ │
-///│  1  │ In memory│   ░░  │░░ │
-///│  4  │─ ─ ▶   sort/merge─ ─ ─ ─ ┼ ─ ─ ─ ─ ─▶ ... │░ │
-///│  1  │ and write to file│   │░░ │
-///└─┘  │ 4  │░ │
-///  ...   ▲│   └░─░─░░ │
-///││░░ │
-///┌─┐  │.─.│
-///│  4  │ │(   )
-///│  3  │─ ─ ─ ─ ─ ─ ─  `─'
-///└─┘
-///
-/// in_mem_batches  spills
-/// (file on disk in Arrow
-///   IPC format)
-/// ```
-///
-/// Once the input is completely read, the spill files are read and
-/// merged with any in memory batches to produce a single total sorted
-/// output:
+/// When memory is exhausted, sorted runs are spilled directly to disk
+/// (one spill file per run — no merge needed since runs are already
+/// sorted). `MultiLevelMergeBuilder` handles the final merge from disk
+/// with dynamic fan-in.
 ///
 /// ```text
-///   .─.
-///  (   )
-///  │`─'│
-///  │  ┌┐   │

Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-15 Thread via GitHub


adriangbot commented on PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4253363299

   🤖 Criterion benchmark running (GKE) | 
[trigger](https://github.com/apache/datafusion/pull/21629#issuecomment-4253348811)
   **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux 
bench-c4253348811-1301-2rfg4 6.12.55+ #1 SMP Sun Feb  1 08:59:41 UTC 2026 
aarch64 GNU/Linux`
   CPU Details (lscpu)
   
   ```
   Architecture:aarch64
   CPU op-mode(s):  64-bit
   Byte Order:  Little Endian
   CPU(s):  16
   On-line CPU(s) list: 0-15
   Vendor ID:   ARM
   Model name:  Neoverse-V2
   Model:   1
   Thread(s) per core:  1
   Core(s) per cluster: 16
   Socket(s):   -
   Cluster(s):  1
   Stepping:r0p1
   BogoMIPS:2000.00
   Flags:   fp asimd evtstrm aes pmull sha1 
sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 
sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 
sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm 
bf16 dgh rng bti
   L1d cache:   1 MiB (16 instances)
   L1i cache:   1 MiB (16 instances)
   L2 cache:32 MiB (16 instances)
   L3 cache:80 MiB (1 instance)
   NUMA node(s):1
   NUMA node0 CPU(s):   0-15
   Vulnerability Gather data sampling:  Not affected
   Vulnerability Indirect target selection: Not affected
   Vulnerability Itlb multihit: Not affected
   Vulnerability L1tf:  Not affected
   Vulnerability Mds:   Not affected
   Vulnerability Meltdown:  Not affected
   Vulnerability Mmio stale data:   Not affected
   Vulnerability Reg file data sampling:Not affected
   Vulnerability Retbleed:  Not affected
   Vulnerability Spec rstack overflow:  Not affected
   Vulnerability Spec store bypass: Mitigation; Speculative Store 
Bypass disabled via prctl
   Vulnerability Spectre v1:Mitigation; __user pointer 
sanitization
   Vulnerability Spectre v2:Mitigation; CSV2, BHB
   Vulnerability Srbds: Not affected
   Vulnerability Tsa:   Not affected
   Vulnerability Tsx async abort:   Not affected
   Vulnerability Vmscape:   Not affected
   ```
   
   
   
   Comparing externalsorter (689dfd2ad59416ddb75f8cc691ae787271da7805) to 
d0692b8 (merge-base) 
[diff](https://github.com/apache/datafusion/compare/d0692b8ee999ed72cd2a8f7000884618adfc06ee..689dfd2ad59416ddb75f8cc691ae787271da7805)
   BENCH_NAME=sort
   BENCH_COMMAND=cargo bench --features=parquet --bench sort
   BENCH_FILTER=
   Results will be posted here when complete
   
   ---
   [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) 
against this benchmark runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-15 Thread via GitHub


mbutrovich commented on PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4253348811

   run benchmark sort


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-14 Thread via GitHub


adriangbot commented on PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4247068843

   🤖 Benchmark completed (GKE) | 
[trigger](https://github.com/apache/datafusion/pull/21629#issuecomment-4246955834)
   
   **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB)
   
   CPU Details (lscpu)
   
   ```
   Architecture:aarch64
   CPU op-mode(s):  64-bit
   Byte Order:  Little Endian
   CPU(s):  16
   On-line CPU(s) list: 0-15
   Vendor ID:   ARM
   Model name:  Neoverse-V2
   Model:   1
   Thread(s) per core:  1
   Core(s) per cluster: 16
   Socket(s):   -
   Cluster(s):  1
   Stepping:r0p1
   BogoMIPS:2000.00
   Flags:   fp asimd evtstrm aes pmull sha1 
sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 
sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 
sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm 
bf16 dgh rng bti
   L1d cache:   1 MiB (16 instances)
   L1i cache:   1 MiB (16 instances)
   L2 cache:32 MiB (16 instances)
   L3 cache:80 MiB (1 instance)
   NUMA node(s):1
   NUMA node0 CPU(s):   0-15
   Vulnerability Gather data sampling:  Not affected
   Vulnerability Indirect target selection: Not affected
   Vulnerability Itlb multihit: Not affected
   Vulnerability L1tf:  Not affected
   Vulnerability Mds:   Not affected
   Vulnerability Meltdown:  Not affected
   Vulnerability Mmio stale data:   Not affected
   Vulnerability Reg file data sampling:Not affected
   Vulnerability Retbleed:  Not affected
   Vulnerability Spec rstack overflow:  Not affected
   Vulnerability Spec store bypass: Mitigation; Speculative Store 
Bypass disabled via prctl
   Vulnerability Spectre v1:Mitigation; __user pointer 
sanitization
   Vulnerability Spectre v2:Mitigation; CSV2, BHB
   Vulnerability Srbds: Not affected
   Vulnerability Tsa:   Not affected
   Vulnerability Tsx async abort:   Not affected
   Vulnerability Vmscape:   Not affected
   ```
   
   
   
   Details
   
   
   ```
   Comparing HEAD and externalsorter
   
   Benchmark tpcds_sf1.json
   
   
┏━━━┳┳┳━━━┓
   ┃ Query ┃   HEAD ┃   
  externalsorter ┃Change ┃
   
┡━━━╇╇╇━━━┩
   │ QQuery 1  │ 49.74 / 50.95 ±0.80 / 52.26 ms │ 49.77 / 50.30 
±0.55 / 51.36 ms │ no change │
   │ QQuery 2  │  143.00 / 143.81 ±0.83 / 145.35 ms │  138.66 / 139.17 
±0.36 / 139.58 ms │ no change │
   │ QQuery 3  │  162.98 / 163.26 ±0.21 / 163.53 ms │  157.53 / 157.95 
±0.50 / 158.63 ms │ no change │
   │ QQuery 4  │  1612.56 / 1660.28 ±27.23 / 1688.23 ms │  1600.19 / 1626.35 
±15.20 / 1645.67 ms │ no change │
   │ QQuery 5  │  269.80 / 279.75 ±5.67 / 286.05 ms │  260.95 / 267.50 
±4.74 / 275.13 ms │ no change │
   │ QQuery 6  │  225.06 / 231.01 ±5.44 / 239.71 ms │  213.47 / 216.62 
±3.33 / 220.82 ms │ +1.07x faster │
   │ QQuery 7  │  369.21 / 373.52 ±3.40 / 378.70 ms │  362.94 / 364.15 
±1.10 / 365.71 ms │ no change │
   │ QQuery 8  │  176.29 / 180.13 ±2.51 / 183.75 ms │  171.53 / 174.34 
±2.10 / 178.06 ms │ no change │
   │ QQuery 9  │  102.02 / 111.04 ±9.41 / 129.21 ms │  100.99 / 103.19 
±2.14 / 106.76 ms │ +1.08x faster │
   │ QQuery 10 │  245.07 / 248.04 ±2.67 / 252.93 ms │  233.82 / 238.41 
±2.95 / 242.19 ms │ no change │
   │ QQuery 11 │ 808.76 / 824.14 ±10.53 / 836.10 ms │  810.15 / 819.46 
±6.49 / 826.78 ms │ no change │
   │ QQuery 12 │ 50.88 / 52.58 ±1.41 / 55.01 ms │ 50.86 / 51.56 
±0.71 / 52.81 ms │ no change │
   │ QQuery 13 │  399.54 / 401.90 ±2.09 / 405.30 ms │  400.11 / 403.87 
±3.19 / 408.63 ms │ no change │
   │ QQuery 14 │  2028.96 / 2043.27 ±11.43 / 2062.97 ms │  1902.55 / 1914.86 
±10.85 / 1932.72 ms │ +1.07x faster │
   │ QQuery 15 │ 88.42 / 89.51 ±1.03 / 91.14 ms │ 87.27 / 87.73 
±0.45 / 88.54 ms │ no chang

Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-14 Thread via GitHub


adriangbot commented on PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4246979942

   🤖 Benchmark completed (GKE) | 
[trigger](https://github.com/apache/datafusion/pull/21629#issuecomment-4246870829)
   
   **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB)
   
   CPU Details (lscpu)
   
   ```
   Architecture:aarch64
   CPU op-mode(s):  64-bit
   Byte Order:  Little Endian
   CPU(s):  16
   On-line CPU(s) list: 0-15
   Vendor ID:   ARM
   Model name:  Neoverse-V2
   Model:   1
   Thread(s) per core:  1
   Core(s) per cluster: 16
   Socket(s):   -
   Cluster(s):  1
   Stepping:r0p1
   BogoMIPS:2000.00
   Flags:   fp asimd evtstrm aes pmull sha1 
sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 
sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 
sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm 
bf16 dgh rng bti
   L1d cache:   1 MiB (16 instances)
   L1i cache:   1 MiB (16 instances)
   L2 cache:32 MiB (16 instances)
   L3 cache:80 MiB (1 instance)
   NUMA node(s):1
   NUMA node0 CPU(s):   0-15
   Vulnerability Gather data sampling:  Not affected
   Vulnerability Indirect target selection: Not affected
   Vulnerability Itlb multihit: Not affected
   Vulnerability L1tf:  Not affected
   Vulnerability Mds:   Not affected
   Vulnerability Meltdown:  Not affected
   Vulnerability Mmio stale data:   Not affected
   Vulnerability Reg file data sampling:Not affected
   Vulnerability Retbleed:  Not affected
   Vulnerability Spec rstack overflow:  Not affected
   Vulnerability Spec store bypass: Mitigation; Speculative Store 
Bypass disabled via prctl
   Vulnerability Spectre v1:Mitigation; __user pointer 
sanitization
   Vulnerability Spectre v2:Mitigation; CSV2, BHB
   Vulnerability Srbds: Not affected
   Vulnerability Tsa:   Not affected
   Vulnerability Tsx async abort:   Not affected
   Vulnerability Vmscape:   Not affected
   ```
   
   
   
   Details
   
   
   ```
   Comparing HEAD and externalsorter
   
   Benchmark tpcds_sf1.json
   
   
┏━━━┳━━┳━━┳━━━┓
   ┃ Query ┃ HEAD ┃ 
  externalsorter ┃Change ┃
   
┡━━━╇━━╇━━╇━━━┩
   │ QQuery 1  │  7.01 / 7.45 ±0.76 / 8.97 ms │  6.86 / 
7.33 ±0.76 / 8.86 ms │ no change │
   │ QQuery 2  │143.89 / 145.42 ±1.02 / 146.77 ms │144.14 / 
144.53 ±0.34 / 144.98 ms │ no change │
   │ QQuery 3  │113.87 / 114.65 ±0.88 / 116.36 ms │112.84 / 
113.44 ±0.33 / 113.81 ms │ no change │
   │ QQuery 4  │1401.22 / 1414.73 ±10.96 / 1426.97 ms │1375.61 / 
1415.42 ±34.40 / 1474.26 ms │ no change │
   │ QQuery 5  │172.81 / 173.85 ±0.86 / 175.03 ms │171.04 / 
173.32 ±1.69 / 176.20 ms │ no change │
   │ QQuery 6  │   849.72 / 867.23 ±21.92 / 910.12 ms │   844.09 / 
868.62 ±14.87 / 886.90 ms │ no change │
   │ QQuery 7  │342.84 / 346.08 ±2.21 / 348.99 ms │340.64 / 
346.49 ±3.18 / 349.42 ms │ no change │
   │ QQuery 8  │114.66 / 118.26 ±2.55 / 121.27 ms │115.19 / 
117.12 ±1.25 / 119.03 ms │ no change │
   │ QQuery 9  │102.77 / 106.97 ±2.19 / 108.84 ms │101.33 / 
104.25 ±2.34 / 107.23 ms │ no change │
   │ QQuery 10 │106.87 / 108.20 ±1.11 / 109.99 ms │106.29 / 
107.45 ±0.83 / 108.26 ms │ no change │
   │ QQuery 11 │  979.18 / 996.03 ±12.64 / 1016.91 ms │  978.22 / 
994.80 ±11.45 / 1005.19 ms │ no change │
   │ QQuery 12 │   45.78 / 46.90 ±0.77 / 47.81 ms │   46.11 / 
46.62 ±0.56 / 47.52 ms │ no change │
   │ QQuery 13 │400.14 / 406.86 ±4.06 / 411.44 ms │399.86 / 
401.85 ±1.71 / 404.03 ms │ no change │
   │ QQuery 14 │ 1005.02 / 1010.73 ±4.14 / 1016.44 ms │ 1006.11 / 
1017.87 ±5.97 / 1022.22 ms │ no change │
   │ QQuery 15 │   15.86 / 16.77 ±0.

Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-14 Thread via GitHub


adriangbot commented on PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4246972434

   🤖 Benchmark running (GKE) | 
[trigger](https://github.com/apache/datafusion/pull/21629#issuecomment-4246955834)
   **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux 
bench-c4246955834-1261-fd4rq 6.12.55+ #1 SMP Sun Feb  1 08:59:41 UTC 2026 
aarch64 GNU/Linux`
   CPU Details (lscpu)
   
   ```
   Architecture:aarch64
   CPU op-mode(s):  64-bit
   Byte Order:  Little Endian
   CPU(s):  16
   On-line CPU(s) list: 0-15
   Vendor ID:   ARM
   Model name:  Neoverse-V2
   Model:   1
   Thread(s) per core:  1
   Core(s) per cluster: 16
   Socket(s):   -
   Cluster(s):  1
   Stepping:r0p1
   BogoMIPS:2000.00
   Flags:   fp asimd evtstrm aes pmull sha1 
sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 
sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 
sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm 
bf16 dgh rng bti
   L1d cache:   1 MiB (16 instances)
   L1i cache:   1 MiB (16 instances)
   L2 cache:32 MiB (16 instances)
   L3 cache:80 MiB (1 instance)
   NUMA node(s):1
   NUMA node0 CPU(s):   0-15
   Vulnerability Gather data sampling:  Not affected
   Vulnerability Indirect target selection: Not affected
   Vulnerability Itlb multihit: Not affected
   Vulnerability L1tf:  Not affected
   Vulnerability Mds:   Not affected
   Vulnerability Meltdown:  Not affected
   Vulnerability Mmio stale data:   Not affected
   Vulnerability Reg file data sampling:Not affected
   Vulnerability Retbleed:  Not affected
   Vulnerability Spec rstack overflow:  Not affected
   Vulnerability Spec store bypass: Mitigation; Speculative Store 
Bypass disabled via prctl
   Vulnerability Spectre v1:Mitigation; __user pointer 
sanitization
   Vulnerability Spectre v2:Mitigation; CSV2, BHB
   Vulnerability Srbds: Not affected
   Vulnerability Tsa:   Not affected
   Vulnerability Tsx async abort:   Not affected
   Vulnerability Vmscape:   Not affected
   ```
   
   
   
   Comparing externalsorter (32a688271e2c9cb67b47454a2bc72f3cb926cfee) to 
e6b32fe (merge-base) 
[diff](https://github.com/apache/datafusion/compare/e6b32fe66b4e8e313ff79a9c45585d1a4bc2a19e..32a688271e2c9cb67b47454a2bc72f3cb926cfee)
 using: tpcds
   Results will be posted here when complete
   
   ---
   [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) 
against this benchmark runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-14 Thread via GitHub


mbutrovich commented on PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4246955834

   run benchmark tpcds
   
   ```
   env:
  PREFER_HASH_JOIN: false
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-14 Thread via GitHub


adriangbot commented on PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4246886420

   Benchmark for [this 
request](https://github.com/apache/datafusion/pull/21629#issuecomment-4246870829)
 failed.
   
   Last 20 lines of output:
   Click to expand
   
   ```
   struct_query_sql
   substr
   substr_index
   substring
   sum
   to_char
   to_hex
   to_local_time
   to_time
   to_timestamp
   topk_aggregate
   topk_repartition
   translate
   trim
   trunc
   unhex
   upper
   uuid
   window_query_sql
   with_hashes
   ```
   
   
   
   ---
   [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) 
against this benchmark runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-14 Thread via GitHub


adriangbot commented on PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4246885961

   🤖 Benchmark running (GKE) | 
[trigger](https://github.com/apache/datafusion/pull/21629#issuecomment-4246870829)
   **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux 
bench-c4246870829-1256-zqlsf 6.12.55+ #1 SMP Sun Feb  1 08:59:41 UTC 2026 
aarch64 GNU/Linux`
   CPU Details (lscpu)
   
   ```
   Architecture:aarch64
   CPU op-mode(s):  64-bit
   Byte Order:  Little Endian
   CPU(s):  16
   On-line CPU(s) list: 0-15
   Vendor ID:   ARM
   Model name:  Neoverse-V2
   Model:   1
   Thread(s) per core:  1
   Core(s) per cluster: 16
   Socket(s):   -
   Cluster(s):  1
   Stepping:r0p1
   BogoMIPS:2000.00
   Flags:   fp asimd evtstrm aes pmull sha1 
sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 
sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 
sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm 
bf16 dgh rng bti
   L1d cache:   1 MiB (16 instances)
   L1i cache:   1 MiB (16 instances)
   L2 cache:32 MiB (16 instances)
   L3 cache:80 MiB (1 instance)
   NUMA node(s):1
   NUMA node0 CPU(s):   0-15
   Vulnerability Gather data sampling:  Not affected
   Vulnerability Indirect target selection: Not affected
   Vulnerability Itlb multihit: Not affected
   Vulnerability L1tf:  Not affected
   Vulnerability Mds:   Not affected
   Vulnerability Meltdown:  Not affected
   Vulnerability Mmio stale data:   Not affected
   Vulnerability Reg file data sampling:Not affected
   Vulnerability Retbleed:  Not affected
   Vulnerability Spec rstack overflow:  Not affected
   Vulnerability Spec store bypass: Mitigation; Speculative Store 
Bypass disabled via prctl
   Vulnerability Spectre v1:Mitigation; __user pointer 
sanitization
   Vulnerability Spectre v2:Mitigation; CSV2, BHB
   Vulnerability Srbds: Not affected
   Vulnerability Tsa:   Not affected
   Vulnerability Tsx async abort:   Not affected
   Vulnerability Vmscape:   Not affected
   ```
   
   
   
   Comparing externalsorter (32a688271e2c9cb67b47454a2bc72f3cb926cfee) to 
e6b32fe (merge-base) 
[diff](https://github.com/apache/datafusion/compare/e6b32fe66b4e8e313ff79a9c45585d1a4bc2a19e..32a688271e2c9cb67b47454a2bc72f3cb926cfee)
 using: tpcds
   Results will be posted here when complete
   
   ---
   [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) 
against this benchmark runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-14 Thread via GitHub


adriangbot commented on PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4246885904

   🤖 Criterion benchmark running (GKE) | 
[trigger](https://github.com/apache/datafusion/pull/21629#issuecomment-4246870829)
   **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux 
bench-c4246870829-1257-bwm8p 6.12.55+ #1 SMP Sun Feb  1 08:59:41 UTC 2026 
aarch64 GNU/Linux`
   CPU Details (lscpu)
   
   ```
   Architecture:aarch64
   CPU op-mode(s):  64-bit
   Byte Order:  Little Endian
   CPU(s):  16
   On-line CPU(s) list: 0-15
   Vendor ID:   ARM
   Model name:  Neoverse-V2
   Model:   1
   Thread(s) per core:  1
   Core(s) per cluster: 16
   Socket(s):   -
   Cluster(s):  1
   Stepping:r0p1
   BogoMIPS:2000.00
   Flags:   fp asimd evtstrm aes pmull sha1 
sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 
sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 
sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm 
bf16 dgh rng bti
   L1d cache:   1 MiB (16 instances)
   L1i cache:   1 MiB (16 instances)
   L2 cache:32 MiB (16 instances)
   L3 cache:80 MiB (1 instance)
   NUMA node(s):1
   NUMA node0 CPU(s):   0-15
   Vulnerability Gather data sampling:  Not affected
   Vulnerability Indirect target selection: Not affected
   Vulnerability Itlb multihit: Not affected
   Vulnerability L1tf:  Not affected
   Vulnerability Mds:   Not affected
   Vulnerability Meltdown:  Not affected
   Vulnerability Mmio stale data:   Not affected
   Vulnerability Reg file data sampling:Not affected
   Vulnerability Retbleed:  Not affected
   Vulnerability Spec rstack overflow:  Not affected
   Vulnerability Spec store bypass: Mitigation; Speculative Store 
Bypass disabled via prctl
   Vulnerability Spectre v1:Mitigation; __user pointer 
sanitization
   Vulnerability Spectre v2:Mitigation; CSV2, BHB
   Vulnerability Srbds: Not affected
   Vulnerability Tsa:   Not affected
   Vulnerability Tsx async abort:   Not affected
   Vulnerability Vmscape:   Not affected
   ```
   
   
   
   Comparing externalsorter (32a688271e2c9cb67b47454a2bc72f3cb926cfee) to 
e6b32fe (merge-base) 
[diff](https://github.com/apache/datafusion/compare/e6b32fe66b4e8e313ff79a9c45585d1a4bc2a19e..32a688271e2c9cb67b47454a2bc72f3cb926cfee)
   BENCH_NAME=tpch_sort
   BENCH_COMMAND=cargo bench --features=parquet --bench tpch_sort
   BENCH_FILTER=
   Results will be posted here when complete
   
   ---
   [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) 
against this benchmark runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-14 Thread via GitHub


Dandandan commented on PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4246870829

   run benchmark tpcds tpch_sort


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-14 Thread via GitHub


adriangbot commented on PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4246806188

   🤖 Benchmark completed (GKE) | 
[trigger](https://github.com/apache/datafusion/pull/21629#issuecomment-4246734700)
   
   **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB)
   
   CPU Details (lscpu)
   
   ```
   Architecture:aarch64
   CPU op-mode(s):  64-bit
   Byte Order:  Little Endian
   CPU(s):  16
   On-line CPU(s) list: 0-15
   Vendor ID:   ARM
   Model name:  Neoverse-V2
   Model:   1
   Thread(s) per core:  1
   Core(s) per cluster: 16
   Socket(s):   -
   Cluster(s):  1
   Stepping:r0p1
   BogoMIPS:2000.00
   Flags:   fp asimd evtstrm aes pmull sha1 
sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 
sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 
sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm 
bf16 dgh rng bti
   L1d cache:   1 MiB (16 instances)
   L1i cache:   1 MiB (16 instances)
   L2 cache:32 MiB (16 instances)
   L3 cache:80 MiB (1 instance)
   NUMA node(s):1
   NUMA node0 CPU(s):   0-15
   Vulnerability Gather data sampling:  Not affected
   Vulnerability Indirect target selection: Not affected
   Vulnerability Itlb multihit: Not affected
   Vulnerability L1tf:  Not affected
   Vulnerability Mds:   Not affected
   Vulnerability Meltdown:  Not affected
   Vulnerability Mmio stale data:   Not affected
   Vulnerability Reg file data sampling:Not affected
   Vulnerability Retbleed:  Not affected
   Vulnerability Spec rstack overflow:  Not affected
   Vulnerability Spec store bypass: Mitigation; Speculative Store 
Bypass disabled via prctl
   Vulnerability Spectre v1:Mitigation; __user pointer 
sanitization
   Vulnerability Spectre v2:Mitigation; CSV2, BHB
   Vulnerability Srbds: Not affected
   Vulnerability Tsa:   Not affected
   Vulnerability Tsx async abort:   Not affected
   Vulnerability Vmscape:   Not affected
   ```
   
   
   
   Details
   
   
   ```
   Comparing HEAD and externalsorter
   
   Benchmark tpch_sf10.json
   
   
┏━━━┳┳┳━━┓
   ┃ Query ┃   HEAD ┃ 
externalsorter ┃   Change ┃
   
┡━━━╇╇╇━━┩
   │ QQuery 1  │  371.99 / 376.96 ±2.69 / 379.67 ms │  373.11 / 374.56 ±0.99 / 
376.21 ms │no change │
   │ QQuery 2  │  136.84 / 139.64 ±2.41 / 142.60 ms │  137.37 / 141.53 ±3.83 / 
148.61 ms │no change │
   │ QQuery 3  │  290.81 / 297.53 ±3.84 / 301.08 ms │  294.21 / 297.24 ±2.49 / 
301.82 ms │no change │
   │ QQuery 4  │  157.06 / 158.25 ±1.04 / 160.18 ms │  163.17 / 163.98 ±0.65 / 
164.88 ms │no change │
   │ QQuery 5  │  428.84 / 434.96 ±5.27 / 444.66 ms │  451.75 / 457.04 ±4.95 / 
464.64 ms │ 1.05x slower │
   │ QQuery 6  │  132.66 / 134.17 ±1.23 / 135.89 ms │  133.50 / 134.61 ±0.86 / 
135.48 ms │no change │
   │ QQuery 7  │ 563.02 / 587.09 ±15.45 / 606.59 ms │  587.86 / 594.62 ±4.33 / 
599.57 ms │no change │
   │ QQuery 8  │  483.62 / 492.86 ±7.21 / 503.79 ms │  478.84 / 491.04 ±9.87 / 
506.50 ms │no change │
   │ QQuery 9  │  707.85 / 714.91 ±6.70 / 724.34 ms │ 669.57 / 691.14 ±14.47 / 
708.41 ms │no change │
   │ QQuery 10 │  333.70 / 347.25 ±7.06 / 354.05 ms │ 340.97 / 354.39 ±10.48 / 
368.41 ms │no change │
   │ QQuery 11 │  104.30 / 112.04 ±5.42 / 119.20 ms │  110.99 / 115.85 ±3.85 / 
122.81 ms │no change │
   │ QQuery 12 │  202.29 / 204.71 ±2.28 / 208.32 ms │  208.39 / 209.76 ±1.29 / 
212.04 ms │no change │
   │ QQuery 13 │  305.63 / 319.39 ±7.94 / 328.85 ms │  309.99 / 315.73 ±6.53 / 
327.63 ms │no change │
   │ QQuery 14 │  180.75 / 182.10 ±0.99 / 183.64 ms │  181.13 / 182.52 ±1.44 / 
185.24 ms │no change │
   │ QQuery 15 │  338.13 / 342.58 ±2.38 / 345.31 ms │  331.27 / 333.51 ±2.16 / 
337.52 ms │no change │
   │ QQuery 16 │ 88.02 / 90.32 ±1.74 / 92.47 ms │ 84.29 / 86.72 ±2.30 / 
89.96 ms │no change │
   │ QQuery 17 │ 771.91 / 803.56 ±18.47 / 824.96 ms

Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-14 Thread via GitHub


mbutrovich commented on PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4246757710

   I also opened a new PR that scales our existing sort benchmark to larger 
numbers: https://github.com/apache/datafusion/pull/21630
   
   In theory we should merge that first, then run the sort benchmark against 
`main` after that is in. In the meantime I can run that locally and update with 
results.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-14 Thread via GitHub


adriangbot commented on PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4246749034

   🤖 Benchmark running (GKE) | 
[trigger](https://github.com/apache/datafusion/pull/21629#issuecomment-4246734700)
   **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux 
bench-c4246734700-1255-84ftb 6.12.55+ #1 SMP Sun Feb  1 08:59:41 UTC 2026 
aarch64 GNU/Linux`
   CPU Details (lscpu)
   
   ```
   Architecture:aarch64
   CPU op-mode(s):  64-bit
   Byte Order:  Little Endian
   CPU(s):  16
   On-line CPU(s) list: 0-15
   Vendor ID:   ARM
   Model name:  Neoverse-V2
   Model:   1
   Thread(s) per core:  1
   Core(s) per cluster: 16
   Socket(s):   -
   Cluster(s):  1
   Stepping:r0p1
   BogoMIPS:2000.00
   Flags:   fp asimd evtstrm aes pmull sha1 
sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 
sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 
sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm 
bf16 dgh rng bti
   L1d cache:   1 MiB (16 instances)
   L1i cache:   1 MiB (16 instances)
   L2 cache:32 MiB (16 instances)
   L3 cache:80 MiB (1 instance)
   NUMA node(s):1
   NUMA node0 CPU(s):   0-15
   Vulnerability Gather data sampling:  Not affected
   Vulnerability Indirect target selection: Not affected
   Vulnerability Itlb multihit: Not affected
   Vulnerability L1tf:  Not affected
   Vulnerability Mds:   Not affected
   Vulnerability Meltdown:  Not affected
   Vulnerability Mmio stale data:   Not affected
   Vulnerability Reg file data sampling:Not affected
   Vulnerability Retbleed:  Not affected
   Vulnerability Spec rstack overflow:  Not affected
   Vulnerability Spec store bypass: Mitigation; Speculative Store 
Bypass disabled via prctl
   Vulnerability Spectre v1:Mitigation; __user pointer 
sanitization
   Vulnerability Spectre v2:Mitigation; CSV2, BHB
   Vulnerability Srbds: Not affected
   Vulnerability Tsa:   Not affected
   Vulnerability Tsx async abort:   Not affected
   Vulnerability Vmscape:   Not affected
   ```
   
   
   
   Comparing externalsorter (32a688271e2c9cb67b47454a2bc72f3cb926cfee) to 
e6b32fe (merge-base) 
[diff](https://github.com/apache/datafusion/compare/e6b32fe66b4e8e313ff79a9c45585d1a4bc2a19e..32a688271e2c9cb67b47454a2bc72f3cb926cfee)
 using: tpch10
   Results will be posted here when complete
   
   ---
   [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) 
against this benchmark runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-14 Thread via GitHub


Dandandan commented on PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4246734700

   run benchmark tpch10


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-14 Thread via GitHub


Dandandan commented on PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4246733187

   Lets compare to hash join
   
   run benchmark tpch10


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-14 Thread via GitHub


Dandandan commented on PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4246726760

   Wooh, nice results


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-14 Thread via GitHub


adriangbot commented on PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4246681907

   🤖 Benchmark completed (GKE) | 
[trigger](https://github.com/apache/datafusion/pull/21629#issuecomment-4246572576)
   
   **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB)
   
   CPU Details (lscpu)
   
   ```
   Architecture:aarch64
   CPU op-mode(s):  64-bit
   Byte Order:  Little Endian
   CPU(s):  16
   On-line CPU(s) list: 0-15
   Vendor ID:   ARM
   Model name:  Neoverse-V2
   Model:   1
   Thread(s) per core:  1
   Core(s) per cluster: 16
   Socket(s):   -
   Cluster(s):  1
   Stepping:r0p1
   BogoMIPS:2000.00
   Flags:   fp asimd evtstrm aes pmull sha1 
sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 
sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 
sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm 
bf16 dgh rng bti
   L1d cache:   1 MiB (16 instances)
   L1i cache:   1 MiB (16 instances)
   L2 cache:32 MiB (16 instances)
   L3 cache:80 MiB (1 instance)
   NUMA node(s):1
   NUMA node0 CPU(s):   0-15
   Vulnerability Gather data sampling:  Not affected
   Vulnerability Indirect target selection: Not affected
   Vulnerability Itlb multihit: Not affected
   Vulnerability L1tf:  Not affected
   Vulnerability Mds:   Not affected
   Vulnerability Meltdown:  Not affected
   Vulnerability Mmio stale data:   Not affected
   Vulnerability Reg file data sampling:Not affected
   Vulnerability Retbleed:  Not affected
   Vulnerability Spec rstack overflow:  Not affected
   Vulnerability Spec store bypass: Mitigation; Speculative Store 
Bypass disabled via prctl
   Vulnerability Spectre v1:Mitigation; __user pointer 
sanitization
   Vulnerability Spectre v2:Mitigation; CSV2, BHB
   Vulnerability Srbds: Not affected
   Vulnerability Tsa:   Not affected
   Vulnerability Tsx async abort:   Not affected
   Vulnerability Vmscape:   Not affected
   ```
   
   
   
   Details
   
   
   ```
   Comparing HEAD and externalsorter
   
   Benchmark tpch_sf10.json
   
   
┏━━━┳┳┳━━━┓
   ┃ Query ┃   HEAD ┃   
  externalsorter ┃Change ┃
   
┡━━━╇╇╇━━━┩
   │ QQuery 1  │  369.64 / 376.08 ±5.23 / 382.99 ms │  369.46 / 374.50 
±3.39 / 378.57 ms │ no change │
   │ QQuery 2  │ 485.14 / 497.33 ±11.38 / 515.80 ms │  438.26 / 446.15 
±6.06 / 456.09 ms │ +1.11x faster │
   │ QQuery 3  │ 589.99 / 660.25 ±37.61 / 696.76 ms │  514.23 / 531.74 
±9.61 / 542.05 ms │ +1.24x faster │
   │ QQuery 4  │ 407.91 / 473.96 ±36.86 / 509.58 ms │  334.86 / 341.61 
±6.01 / 352.42 ms │ +1.39x faster │
   │ QQuery 5  │  1118.92 / 1156.54 ±22.75 / 1181.28 ms │  1025.31 / 1066.24 
±22.28 / 1090.89 ms │ +1.08x faster │
   │ QQuery 6  │  133.59 / 136.32 ±2.93 / 141.42 ms │  134.30 / 140.19 
±6.63 / 152.36 ms │ no change │
   │ QQuery 7  │  1563.72 / 1581.82 ±16.09 / 1609.94 ms │  1361.03 / 1383.17 
±20.21 / 1417.51 ms │ +1.14x faster │
   │ QQuery 8  │ 1471.30 / 1899.23 ±344.46 / 2196.56 ms │ 1160.48 / 1254.40 
±138.94 / 1530.54 ms │ +1.51x faster │
   │ QQuery 9  │ 2125.14 / 2373.70 ±187.56 / 2699.90 ms │  1781.61 / 1916.18 
±81.59 / 2024.46 ms │ +1.24x faster │
   │ QQuery 10 │  544.27 / 550.30 ±3.86 / 555.77 ms │ 495.92 / 510.99 
±10.93 / 524.00 ms │ +1.08x faster │
   │ QQuery 11 │ 455.29 / 469.03 ±11.12 / 484.05 ms │  426.61 / 433.24 
±5.62 / 442.76 ms │ +1.08x faster │
   │ QQuery 12 │  295.00 / 301.36 ±4.84 / 308.51 ms │  279.36 / 282.79 
±2.83 / 286.06 ms │ +1.07x faster │
   │ QQuery 13 │  370.67 / 377.16 ±7.15 / 391.07 ms │  344.14 / 353.84 
±5.61 / 360.05 ms │ +1.07x faster │
   │ QQuery 14 │  193.64 / 199.80 ±5.22 / 207.45 ms │  191.75 / 199.09 
±6.65 / 209.34 ms │ no change │
   │ QQuery 15 │  322.12 / 329.85 ±6.15 / 339.31 ms │  321.21 / 329.66 
±7.24 / 342.24 ms │ no chang

Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-14 Thread via GitHub


mbutrovich commented on code in PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#discussion_r3081924897


##
datafusion/common/src/config.rs:
##
@@ -555,7 +555,17 @@ config_namespace! {
 /// When sorting, below what size should data be concatenated
 /// and sorted in a single RecordBatch rather than sorted in
 /// batches and merged.
-pub sort_in_place_threshold_bytes: usize, default = 1024 * 1024
+///
+/// Deprecated: this option is no longer used. The sort pipeline
+/// now always coalesces batches before sorting. Use
+/// `sort_coalesce_target_rows` instead.
+pub sort_in_place_threshold_bytes: usize, warn = 
"`sort_in_place_threshold_bytes` is deprecated and ignored. Use 
`sort_coalesce_target_rows` instead.", default = 1024 * 1024
+
+/// Target number of rows to coalesce before sorting in ExternalSorter.
+///
+/// Larger values reduce merge fan-in by producing fewer, larger
+/// sorted runs.
+pub sort_coalesce_target_rows: usize, default = 32768

Review Comment:
   Yeah I suspect we'd want to do a good sensitivity analysis on different 
types and batch sizes for `lexsort_to_indices` (and eventually the radix sort 
kernel). We might hit a point of diminishing returns/cache friendliness if our 
coalesced batches get too large.
   
   This design also first spills from the sorted runs, so holding more unsorted 
rows in the coalescer may make it more likely for us to trigger spilling.
   
   I'm definitely of the mind that we can and should tune this, but unclear 
what even a reasonable default right now would be. In Comet where we run TPC-H 
SF 1000, for example,  I suspect we'll want longer sorted runs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-14 Thread via GitHub


adriangbot commented on PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4246583949

   🤖 Benchmark running (GKE) | 
[trigger](https://github.com/apache/datafusion/pull/21629#issuecomment-4246572576)
   **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux 
bench-c4246572576-1254-cr9dg 6.12.55+ #1 SMP Sun Feb  1 08:59:41 UTC 2026 
aarch64 GNU/Linux`
   CPU Details (lscpu)
   
   ```
   Architecture:aarch64
   CPU op-mode(s):  64-bit
   Byte Order:  Little Endian
   CPU(s):  16
   On-line CPU(s) list: 0-15
   Vendor ID:   ARM
   Model name:  Neoverse-V2
   Model:   1
   Thread(s) per core:  1
   Core(s) per cluster: 16
   Socket(s):   -
   Cluster(s):  1
   Stepping:r0p1
   BogoMIPS:2000.00
   Flags:   fp asimd evtstrm aes pmull sha1 
sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 
sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 
sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm 
bf16 dgh rng bti
   L1d cache:   1 MiB (16 instances)
   L1i cache:   1 MiB (16 instances)
   L2 cache:32 MiB (16 instances)
   L3 cache:80 MiB (1 instance)
   NUMA node(s):1
   NUMA node0 CPU(s):   0-15
   Vulnerability Gather data sampling:  Not affected
   Vulnerability Indirect target selection: Not affected
   Vulnerability Itlb multihit: Not affected
   Vulnerability L1tf:  Not affected
   Vulnerability Mds:   Not affected
   Vulnerability Meltdown:  Not affected
   Vulnerability Mmio stale data:   Not affected
   Vulnerability Reg file data sampling:Not affected
   Vulnerability Retbleed:  Not affected
   Vulnerability Spec rstack overflow:  Not affected
   Vulnerability Spec store bypass: Mitigation; Speculative Store 
Bypass disabled via prctl
   Vulnerability Spectre v1:Mitigation; __user pointer 
sanitization
   Vulnerability Spectre v2:Mitigation; CSV2, BHB
   Vulnerability Srbds: Not affected
   Vulnerability Tsa:   Not affected
   Vulnerability Tsx async abort:   Not affected
   Vulnerability Vmscape:   Not affected
   ```
   
   
   
   Comparing externalsorter (32a688271e2c9cb67b47454a2bc72f3cb926cfee) to 
e6b32fe (merge-base) 
[diff](https://github.com/apache/datafusion/compare/e6b32fe66b4e8e313ff79a9c45585d1a4bc2a19e..32a688271e2c9cb67b47454a2bc72f3cb926cfee)
 using: tpch10
   Results will be posted here when complete
   
   ---
   [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) 
against this benchmark runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-14 Thread via GitHub


mbutrovich commented on PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4246572576

   run benchmark tpch10
   
   ```
   env:
  PREFER_HASH_JOIN: false
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]

2026-04-14 Thread via GitHub


Dandandan commented on code in PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#discussion_r3081762076


##
datafusion/common/src/config.rs:
##
@@ -555,7 +555,17 @@ config_namespace! {
 /// When sorting, below what size should data be concatenated
 /// and sorted in a single RecordBatch rather than sorted in
 /// batches and merged.
-pub sort_in_place_threshold_bytes: usize, default = 1024 * 1024
+///
+/// Deprecated: this option is no longer used. The sort pipeline
+/// now always coalesces batches before sorting. Use
+/// `sort_coalesce_target_rows` instead.
+pub sort_in_place_threshold_bytes: usize, warn = 
"`sort_in_place_threshold_bytes` is deprecated and ignored. Use 
`sort_coalesce_target_rows` instead.", default = 1024 * 1024
+
+/// Target number of rows to coalesce before sorting in ExternalSorter.
+///
+/// Larger values reduce merge fan-in by producing fewer, larger
+/// sorted runs.
+pub sort_coalesce_target_rows: usize, default = 32768

Review Comment:
   I wonder if we can make this somewhat adaptive: as we usually load 
everything in memory, it seems for very large sets larger batches would be even 
more favorable (e.g. use 10MiB "scratch space" for coalescing instead of 32KiB 
rows would make sense if our data is 1GiB and perhaps be even faster?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]