Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
mbutrovich closed pull request #21629: perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in URL: https://github.com/apache/datafusion/pull/21629 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
gratus00 commented on code in PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#discussion_r3097558112
##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -402,381 +496,199 @@ impl ExternalSorter {
self.metrics.spill_metrics.spill_file_count.value()
}
-/// Appending globally sorted batches to the in-progress spill file, and
clears
-/// the `globally_sorted_batches` (also its memory reservation) afterwards.
-async fn consume_and_spill_append(
-&mut self,
-globally_sorted_batches: &mut Vec,
-) -> Result<()> {
-if globally_sorted_batches.is_empty() {
-return Ok(());
-}
-
-// Lazily initialize the in-progress spill file
-if self.in_progress_spill_file.is_none() {
-self.in_progress_spill_file =
-Some((self.spill_manager.create_in_progress_file("Sorting")?,
0));
-}
-
-Self::organize_stringview_arrays(globally_sorted_batches)?;
-
-debug!("Spilling sort data of ExternalSorter to disk whilst
inserting");
-
-let batches_to_spill = std::mem::take(globally_sorted_batches);
-self.reservation.free();
-
-let (in_progress_file, max_record_batch_size) =
-self.in_progress_spill_file.as_mut().ok_or_else(|| {
-internal_datafusion_err!("In-progress spill file should be
initialized")
-})?;
-
-for batch in batches_to_spill {
-in_progress_file.append_batch(&batch)?;
-
-*max_record_batch_size =
-(*max_record_batch_size).max(batch.get_sliced_size()?);
-}
-
-assert_or_internal_err!(
-globally_sorted_batches.is_empty(),
-"This function consumes globally_sorted_batches, so it should be
empty after taking."
-);
-
-Ok(())
-}
-
-/// Finishes the in-progress spill file and moves it to the finished spill
files.
-async fn spill_finish(&mut self) -> Result<()> {
-let (mut in_progress_file, max_record_batch_memory) =
-self.in_progress_spill_file.take().ok_or_else(|| {
-internal_datafusion_err!("Should be called after
`spill_append`")
-})?;
-let spill_file = in_progress_file.finish()?;
-
-if let Some(spill_file) = spill_file {
-self.finished_spill_files.push(SortedSpillFile {
-file: spill_file,
-max_record_batch_memory,
-});
-}
-
-Ok(())
-}
-
-/// Reconstruct `globally_sorted_batches` to organize the payload buffers
of each
-/// `StringViewArray` in sequential order by calling `gc()` on them.
-///
-/// Note this is a workaround until
is
-/// available
-///
-/// # Rationale
-/// After (merge-based) sorting, all batches will be sorted into a single
run,
-/// but physically this sorted run is chunked into many small batches. For
-/// `StringViewArray`s inside each sorted run, their inner buffers are not
-/// re-constructed by default, leading to non-sequential payload locations
-/// (permutated by `interleave()` Arrow kernel). A single payload buffer
might
-/// be shared by multiple `RecordBatch`es.
-/// When writing each batch to disk, the writer has to write all
referenced buffers,
-/// because they have to be read back one by one to reduce memory usage.
This
-/// causes extra disk reads and writes, and potentially execution failure.
+/// Spills sorted runs to disk.
///
-/// # Example
-/// Before sorting:
-/// batch1 -> buffer1
-/// batch2 -> buffer2
+/// Two strategies depending on available merge headroom:
///
-/// sorted_batch1 -> buffer1
-/// -> buffer2
-/// sorted_batch2 -> buffer1
-/// -> buffer2
+/// - **With headroom** (`merge_reservation > 0`): merge all runs into
+/// a single globally-sorted stream, then write to one spill file.
+/// Fewer spill files = lower fan-in for the final MultiLevelMerge.
///
-/// Then when spilling each batch, the writer has to write all referenced
buffers
-/// repeatedly.
-fn organize_stringview_arrays(
-globally_sorted_batches: &mut Vec,
-) -> Result<()> {
-let mut organized_batches =
Vec::with_capacity(globally_sorted_batches.len());
-
-for batch in globally_sorted_batches.drain(..) {
-let mut new_columns: Vec> =
-Vec::with_capacity(batch.num_columns());
+/// - **Without headroom** (`merge_reservation == 0`): spill each run
+/// as its own file. Avoids allocating merge cursor infrastructure
+/// when the pool has no room. MultiLevelMerge handles the higher
+/// fan-in with dynamic memory management.
+async fn spill_sorted_runs(&mut self) -> Result<()> {
+assert_or_internal_err!(
+s
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
mbutrovich commented on PR #21629: URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4262519681 More brainstorming with Claude: > ### Proposed fix > > Don't coalesce schemas that have variable-length non-view columns (StringArray, BinaryArray, LargeStringArray, LargeBinaryArray, DictionaryArray with string/binary values). For these schemas, set the coalesce target to `batch_size` instead of `sort_coalesce_target_rows`: > > - Full-size batches pass through uncoalesced and sort individually, same as `main` — no regression > - Small batches (from filters, joins, pushdown) coalesce up to `batch_size` only — some fan-in reduction without blowing cache > - StringView and fixed-width schemas keep the full 32K coalesce target and retain all speedups > > Fan-in reduction is the price we pay for StringArray schemas. We can't optimize the merge if the sort itself is 2.5x slower getting there. > > Comet does not support StringView — all string columns use StringArray, so this regression directly affects Comet workloads. > > ### Why check all columns, not just sort keys? > > `take` reorders all columns in the batch, not just sort keys. Wide schemas (e.g. small string key + 70KB string value) are especially vulnerable — coalescing to 32K means `take` scatter-gathers across ~2.2GB of value data. > > ### Alternatives considered > > **Smaller intermediate target (e.g. 16K):** Any coalescing beyond `batch_size` amplifies cache pressure for wide schemas proportionally. No clear sweet spot. > > **Convert StringArray → StringView before take:** The gather (conversion) is cheap and `take` on views is cheap, but the resulting views point to the original unsorted buffer. No memory freed until the entire buffer drops — ~2x memory bloat, wrong tradeoff for a spilling sort. > -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
mbutrovich commented on PR #21629: URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4261197731 Marking as draft while we explore performance trade-offs. I don't want to accidentally merge this yet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
mbutrovich commented on PR #21629: URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4261195635 ## Benchmark Analysis: `externalsorter` vs `main` Ran `cargo bench --bench sort -p datafusion` on both branches. ### Improvements (numeric, StringView, single-column string) | Benchmark | externalsorter | main | Speedup | |:---|---:|---:|---:| | sort i64 100k | 2.40 ms | 3.88 ms | 1.61x | | sort f64 100k | 2.61 ms | 3.94 ms | 1.51x | | sort merge i64 1M | 26.6 ms | 38.5 ms | 1.45x | | sort i64 1M | 33.8 ms | 45.5 ms | 1.35x | | sort f64 1M | 35.8 ms | 46.4 ms | 1.30x | | sort utf8 high cardinality 1M | 59.7 ms | 74.0 ms | 1.24x | | sort utf8 view tuple 1M | 8.1 ms | 9.1 ms | 1.13x | ### Regressions (multi-column StringArray and Dictionary) | Benchmark | externalsorter | main | Slowdown | |:---|---:|---:|---:| | sort utf8 tuple 100k (3x StringArray) | 23.7 ms | 9.5 ms | 2.5x | | sort utf8 tuple 1M (3x StringArray) | 272 ms | 111 ms | 2.5x | | sort mixed dictionary tuple 1M (3x dict + i64) | 283 ms | 108 ms | 2.6x | | sort utf8 dictionary tuple 1M (3x dict) | 92.4 ms | 78.5 ms | 1.18x | ### Root cause The coalesce-then-sort pipeline does: coalesce (copy all column data) -> `lexsort_to_indices` -> `take` (random-access scatter-gather) -> chunk back to `batch_size`. With multiple StringArray or Dictionary columns at 32K rows, the `take` scatter-gathers across ~1.9 MB of string heap data, exceeding L2 cache. Schemas that **don't** regress help pinpoint the cause: - **Single StringArray** (e.g. `sort utf8 high cardinality 1M`, +1.24x): one ~640KB string buffer fits in L2 during `take` - **StringViewArray** (e.g. `sort utf8 view tuple 1M`, +1.13x): `take` copies fixed-size 16-byte view structs — no heap random access - **Fixed-width** (i64, f64): coalesce is memcpy, `take` is sequential — all cache-friendly ### Benchmark batch size The sort benchmark uses `BATCH_SIZE = 1024`, while DataFusion's default is 8192. This makes the coalesce expand 32x (1024 -> 32768) instead of 4x (8192 -> 32768), amplifying the copy cost. Small batches are a valid scenario though — filters, joins, and sources with pushdown can all produce sub-8192 batches. ### Proposed fix Arrow's `BatchCoalescer` has a `with_biggest_coalesce_batch_size(Some(limit))` option: batches larger than `limit` bypass coalescing and pass through directly. For schemas with non-view variable-length columns (StringArray, BinaryArray, DictionaryArray), set this to `batch_size`. This way: - Full-size batches (>= 8192 rows) pass through without string copying — sorted directly as individual runs - Small batches still coalesce to reduce fan-in - Fixed-width and StringView schemas keep the full 32K coalesce target unchanged -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
mbutrovich commented on PR #21629: URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4259841218 > Would be nice to see results in `sort` bench / `sort_tpch` etc. to see if we didnt have any regressions. I’ll try to run locally with the scaled up `sort` benchmark. We may want to consider scaling some of the scenarios independently if 1M is too big to apply across the board for the bot. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
Dandandan commented on PR #21629: URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4257501222 Would be nice to see results in `sort` bench / `sort_tpch` etc. to see if we didnt have any regressions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
Dandandan commented on code in PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#discussion_r3090863657
##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -95,116 +93,76 @@ impl ExternalSorterMetrics {
///
/// # Algorithm
///
-/// 1. get a non-empty new batch from input
+/// Incoming batches are coalesced via [`BatchCoalescer`] to
+/// `sort_coalesce_target_rows` (default 32768) before sorting. This
+/// reduces merge fan-in by producing fewer, larger sorted runs.
///
-/// 2. check with the memory manager there is sufficient space to
-///buffer the batch in memory.
+/// Each coalesced batch is sorted immediately and stored as a
+/// pre-sorted run.
///
-/// 2.1 if memory is sufficient, buffer batch in memory, go to 1.
+/// 1. For each incoming batch:
+///- Reserve memory (2x batch size). If reservation fails, flush
+/// the coalescer, spill all sorted runs to disk, then retry.
+///- Push batch into the coalescer.
+///- If the coalescer reached its target: sort the coalesced batch
+/// and store as a new sorted run.
///
-/// 2.2 if no more memory is available, sort all buffered batches and
-/// spill to file. buffer the next batch in memory, go to 1.
-///
-/// 3. when input is exhausted, merge all in memory batches and spills
-///to get a total order.
+/// 2. When input is exhausted, merge all sorted runs (and any spill
+///files) to produce a total order.
///
/// # When data fits in available memory
///
-/// If there is sufficient memory, data is sorted in memory to produce the
output
+/// Sorted runs are merged in memory using a loser-tree k-way merge
+/// (via [`StreamingMergeBuilder`]).
///
/// ```text
-///┌─┐
-///│ 2 │
-///│ 3 │
-///│ 1 │─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
-///│ 4 │
-///│ 2 │ │
-///└─┘ ▼
-///┌─┐
-///│ 1 │ In memory
-///│ 4 │─ ─ ─ ─ ─ ─▶ sort/merge ─ ─ ─ ─ ─▶ total sorted output
-///│ 1 │
-///└─┘ ▲
-/// ...│
-///
-///┌─┐ │
-///│ 4 │
-///│ 3 │─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
-///└─┘
-///
-/// in_mem_batches
+/// ┌──┐ ┌┐ ┌──┐ ┌┐
+/// │ Incoming │▶│ Batch │▶│ Sort │▶│ Sorted Run │
+/// │ Batches │ │ Coalescer │ │ │ │ (in memory)│
+/// └──┘ └┘ └──┘ └─┬──┘
+/// │
+/// ┌──┘
+/// ▼
+///k-way merge (loser tree)
+/// │
+/// ▼
+///total sorted output
/// ```
///
/// # When data does not fit in available memory
///
-/// When memory is exhausted, data is first sorted and written to one
-/// or more spill files on disk:
-///
-/// ```text
-///┌─┐ .─.
-///│ 2 │ ( )
-///│ 3 │ │`─'│
-///│ 1 │─ ─ ─ ─ ─ ─ ─ │ ┌┐ │
-///│ 4 │ ││ │ 1 │░ │
-///│ 2 │ │ │... │░ │
-///└─┘ ▼│ │ 4 │░ ┌ ─ ─ │
-///┌─┐ │ └┘░1 │░ │
-///│ 1 │ In memory│ ░░ │░░ │
-///│ 4 │─ ─ ▶ sort/merge─ ─ ─ ─ ┼ ─ ─ ─ ─ ─▶ ... │░ │
-///│ 1 │ and write to file│ │░░ │
-///└─┘ │ 4 │░ │
-/// ... ▲│ └░─░─░░ │
-///││░░ │
-///┌─┐ │.─.│
-///│ 4 │ │( )
-///│ 3 │─ ─ ─ ─ ─ ─ ─ `─'
-///└─┘
-///
-/// in_mem_batches spills
-/// (file on disk in Arrow
-/// IPC format)
-/// ```
-///
-/// Once the input is completely read, the spill files are read and
-/// merged with any in memory batches to produce a single total sorted
-/// output:
+/// When memory is exhausted, sorted runs are spilled directly to disk
+/// (one spill file per run — no merge needed since runs are already
+/// sorted). `MultiLevelMergeBuilder` handles the final merge from disk
+/// with dynamic fan-in.
///
/// ```text
-/// .─.
-/// ( )
-/// │`─'│
-/// │ ┌┐ │
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
adriangbot commented on PR #21629: URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4255675609 🤖 Criterion benchmark running (GKE) | [trigger](https://github.com/apache/datafusion/pull/21629#issuecomment-4255649635) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux bench-c4255649635-1332-2jl79 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux` CPU Details (lscpu) ``` Architecture:aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 16 Socket(s): - Cluster(s): 1 Stepping:r0p1 BogoMIPS:2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache:32 MiB (16 instances) L3 cache:80 MiB (1 instance) NUMA node(s):1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling:Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1:Mitigation; __user pointer sanitization Vulnerability Spectre v2:Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected ``` Comparing externalsorter (01983025442464d834808a036564f709af58aaff) to 5c653be (merge-base) [diff](https://github.com/apache/datafusion/compare/5c653bee5da64003915f6dfeb3da15759b091a8d..01983025442464d834808a036564f709af58aaff) BENCH_NAME=sort BENCH_COMMAND=cargo bench --features=parquet --bench sort BENCH_FILTER= Results will be posted here when complete --- [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) against this benchmark runner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
mbutrovich commented on PR #21629: URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4255649635 run benchmark sort -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
mbutrovich commented on code in PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#discussion_r3088742930
##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -95,116 +93,76 @@ impl ExternalSorterMetrics {
///
/// # Algorithm
///
-/// 1. get a non-empty new batch from input
+/// Incoming batches are coalesced via [`BatchCoalescer`] to
+/// `sort_coalesce_target_rows` (default 32768) before sorting. This
+/// reduces merge fan-in by producing fewer, larger sorted runs.
///
-/// 2. check with the memory manager there is sufficient space to
-///buffer the batch in memory.
+/// Each coalesced batch is sorted immediately and stored as a
+/// pre-sorted run.
///
-/// 2.1 if memory is sufficient, buffer batch in memory, go to 1.
+/// 1. For each incoming batch:
+///- Reserve memory (2x batch size). If reservation fails, flush
+/// the coalescer, spill all sorted runs to disk, then retry.
+///- Push batch into the coalescer.
+///- If the coalescer reached its target: sort the coalesced batch
+/// and store as a new sorted run.
///
-/// 2.2 if no more memory is available, sort all buffered batches and
-/// spill to file. buffer the next batch in memory, go to 1.
-///
-/// 3. when input is exhausted, merge all in memory batches and spills
-///to get a total order.
+/// 2. When input is exhausted, merge all sorted runs (and any spill
+///files) to produce a total order.
///
/// # When data fits in available memory
///
-/// If there is sufficient memory, data is sorted in memory to produce the
output
+/// Sorted runs are merged in memory using a loser-tree k-way merge
+/// (via [`StreamingMergeBuilder`]).
///
/// ```text
-///┌─┐
-///│ 2 │
-///│ 3 │
-///│ 1 │─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
-///│ 4 │
-///│ 2 │ │
-///└─┘ ▼
-///┌─┐
-///│ 1 │ In memory
-///│ 4 │─ ─ ─ ─ ─ ─▶ sort/merge ─ ─ ─ ─ ─▶ total sorted output
-///│ 1 │
-///└─┘ ▲
-/// ...│
-///
-///┌─┐ │
-///│ 4 │
-///│ 3 │─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
-///└─┘
-///
-/// in_mem_batches
+/// ┌──┐ ┌┐ ┌──┐ ┌┐
+/// │ Incoming │▶│ Batch │▶│ Sort │▶│ Sorted Run │
+/// │ Batches │ │ Coalescer │ │ │ │ (in memory)│
+/// └──┘ └┘ └──┘ └─┬──┘
+/// │
+/// ┌──┘
+/// ▼
+///k-way merge (loser tree)
+/// │
+/// ▼
+///total sorted output
/// ```
///
/// # When data does not fit in available memory
///
-/// When memory is exhausted, data is first sorted and written to one
-/// or more spill files on disk:
-///
-/// ```text
-///┌─┐ .─.
-///│ 2 │ ( )
-///│ 3 │ │`─'│
-///│ 1 │─ ─ ─ ─ ─ ─ ─ │ ┌┐ │
-///│ 4 │ ││ │ 1 │░ │
-///│ 2 │ │ │... │░ │
-///└─┘ ▼│ │ 4 │░ ┌ ─ ─ │
-///┌─┐ │ └┘░1 │░ │
-///│ 1 │ In memory│ ░░ │░░ │
-///│ 4 │─ ─ ▶ sort/merge─ ─ ─ ─ ┼ ─ ─ ─ ─ ─▶ ... │░ │
-///│ 1 │ and write to file│ │░░ │
-///└─┘ │ 4 │░ │
-/// ... ▲│ └░─░─░░ │
-///││░░ │
-///┌─┐ │.─.│
-///│ 4 │ │( )
-///│ 3 │─ ─ ─ ─ ─ ─ ─ `─'
-///└─┘
-///
-/// in_mem_batches spills
-/// (file on disk in Arrow
-/// IPC format)
-/// ```
-///
-/// Once the input is completely read, the spill files are read and
-/// merged with any in memory batches to produce a single total sorted
-/// output:
+/// When memory is exhausted, sorted runs are spilled directly to disk
+/// (one spill file per run — no merge needed since runs are already
+/// sorted). `MultiLevelMergeBuilder` handles the final merge from disk
+/// with dynamic fan-in.
///
/// ```text
-/// .─.
-/// ( )
-/// │`─'│
-/// │ ┌┐ │
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
Dandandan commented on code in PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#discussion_r3088547619
##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -323,56 +285,188 @@ impl ExternalSorter {
self.reserve_memory_for_batch_and_maybe_spill(&input)
.await?;
-self.in_mem_batches.push(input);
+let coalescer = self
+.coalescer
+.as_mut()
+.expect("coalescer must exist during insert phase");
+coalescer
+.push_batch(input)
+.map_err(|e| DataFusionError::ArrowError(Box::new(e), None))?;
+
+self.drain_completed_batches()?;
+
+Ok(())
+}
+
+/// Drains completed (full) batches from the coalescer, sorts each,
+/// and appends the sorted chunks to `sorted_runs`.
+fn drain_completed_batches(&mut self) -> Result<()> {
+// Collect completed batches first to avoid borrow conflict
+let mut completed = vec![];
+if let Some(coalescer) = self.coalescer.as_mut() {
+while let Some(batch) = coalescer.next_completed_batch() {
+completed.push(batch);
+}
+}
+for batch in &completed {
+self.sort_and_store_run(batch)?;
Review Comment:
Nice - I wonder if a part of the speedup comes from this (sorting batch
immediately (while in CPU cache) instead of waiting to the end.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
Dandandan commented on code in PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#discussion_r3088547619
##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -323,56 +285,188 @@ impl ExternalSorter {
self.reserve_memory_for_batch_and_maybe_spill(&input)
.await?;
-self.in_mem_batches.push(input);
+let coalescer = self
+.coalescer
+.as_mut()
+.expect("coalescer must exist during insert phase");
+coalescer
+.push_batch(input)
+.map_err(|e| DataFusionError::ArrowError(Box::new(e), None))?;
+
+self.drain_completed_batches()?;
+
+Ok(())
+}
+
+/// Drains completed (full) batches from the coalescer, sorts each,
+/// and appends the sorted chunks to `sorted_runs`.
+fn drain_completed_batches(&mut self) -> Result<()> {
+// Collect completed batches first to avoid borrow conflict
+let mut completed = vec![];
+if let Some(coalescer) = self.coalescer.as_mut() {
+while let Some(batch) = coalescer.next_completed_batch() {
+completed.push(batch);
+}
+}
+for batch in &completed {
+self.sort_and_store_run(batch)?;
Review Comment:
Nice - I wonder if a part of the speedup comes from this, sorting batch
immediately (while in CPU cache) instead of waiting to the end.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
mbutrovich commented on PR #21629: URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4254421073 I fear my sort benchmark scale up might have killed the benchmark bot :( -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
Dandandan commented on code in PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#discussion_r3088501037
##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -95,116 +93,76 @@ impl ExternalSorterMetrics {
///
/// # Algorithm
///
-/// 1. get a non-empty new batch from input
+/// Incoming batches are coalesced via [`BatchCoalescer`] to
+/// `sort_coalesce_target_rows` (default 32768) before sorting. This
+/// reduces merge fan-in by producing fewer, larger sorted runs.
///
-/// 2. check with the memory manager there is sufficient space to
-///buffer the batch in memory.
+/// Each coalesced batch is sorted immediately and stored as a
+/// pre-sorted run.
///
-/// 2.1 if memory is sufficient, buffer batch in memory, go to 1.
+/// 1. For each incoming batch:
+///- Reserve memory (2x batch size). If reservation fails, flush
+/// the coalescer, spill all sorted runs to disk, then retry.
+///- Push batch into the coalescer.
+///- If the coalescer reached its target: sort the coalesced batch
+/// and store as a new sorted run.
///
-/// 2.2 if no more memory is available, sort all buffered batches and
-/// spill to file. buffer the next batch in memory, go to 1.
-///
-/// 3. when input is exhausted, merge all in memory batches and spills
-///to get a total order.
+/// 2. When input is exhausted, merge all sorted runs (and any spill
+///files) to produce a total order.
///
/// # When data fits in available memory
///
-/// If there is sufficient memory, data is sorted in memory to produce the
output
+/// Sorted runs are merged in memory using a loser-tree k-way merge
+/// (via [`StreamingMergeBuilder`]).
///
/// ```text
-///┌─┐
-///│ 2 │
-///│ 3 │
-///│ 1 │─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
-///│ 4 │
-///│ 2 │ │
-///└─┘ ▼
-///┌─┐
-///│ 1 │ In memory
-///│ 4 │─ ─ ─ ─ ─ ─▶ sort/merge ─ ─ ─ ─ ─▶ total sorted output
-///│ 1 │
-///└─┘ ▲
-/// ...│
-///
-///┌─┐ │
-///│ 4 │
-///│ 3 │─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
-///└─┘
-///
-/// in_mem_batches
+/// ┌──┐ ┌┐ ┌──┐ ┌┐
+/// │ Incoming │▶│ Batch │▶│ Sort │▶│ Sorted Run │
+/// │ Batches │ │ Coalescer │ │ │ │ (in memory)│
+/// └──┘ └┘ └──┘ └─┬──┘
+/// │
+/// ┌──┘
+/// ▼
+///k-way merge (loser tree)
+/// │
+/// ▼
+///total sorted output
/// ```
///
/// # When data does not fit in available memory
///
-/// When memory is exhausted, data is first sorted and written to one
-/// or more spill files on disk:
-///
-/// ```text
-///┌─┐ .─.
-///│ 2 │ ( )
-///│ 3 │ │`─'│
-///│ 1 │─ ─ ─ ─ ─ ─ ─ │ ┌┐ │
-///│ 4 │ ││ │ 1 │░ │
-///│ 2 │ │ │... │░ │
-///└─┘ ▼│ │ 4 │░ ┌ ─ ─ │
-///┌─┐ │ └┘░1 │░ │
-///│ 1 │ In memory│ ░░ │░░ │
-///│ 4 │─ ─ ▶ sort/merge─ ─ ─ ─ ┼ ─ ─ ─ ─ ─▶ ... │░ │
-///│ 1 │ and write to file│ │░░ │
-///└─┘ │ 4 │░ │
-/// ... ▲│ └░─░─░░ │
-///││░░ │
-///┌─┐ │.─.│
-///│ 4 │ │( )
-///│ 3 │─ ─ ─ ─ ─ ─ ─ `─'
-///└─┘
-///
-/// in_mem_batches spills
-/// (file on disk in Arrow
-/// IPC format)
-/// ```
-///
-/// Once the input is completely read, the spill files are read and
-/// merged with any in memory batches to produce a single total sorted
-/// output:
+/// When memory is exhausted, sorted runs are spilled directly to disk
+/// (one spill file per run — no merge needed since runs are already
+/// sorted). `MultiLevelMergeBuilder` handles the final merge from disk
+/// with dynamic fan-in.
///
/// ```text
-/// .─.
-/// ( )
-/// │`─'│
-/// │ ┌┐ │
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
Dandandan commented on code in PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#discussion_r3088501037
##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -95,116 +93,76 @@ impl ExternalSorterMetrics {
///
/// # Algorithm
///
-/// 1. get a non-empty new batch from input
+/// Incoming batches are coalesced via [`BatchCoalescer`] to
+/// `sort_coalesce_target_rows` (default 32768) before sorting. This
+/// reduces merge fan-in by producing fewer, larger sorted runs.
///
-/// 2. check with the memory manager there is sufficient space to
-///buffer the batch in memory.
+/// Each coalesced batch is sorted immediately and stored as a
+/// pre-sorted run.
///
-/// 2.1 if memory is sufficient, buffer batch in memory, go to 1.
+/// 1. For each incoming batch:
+///- Reserve memory (2x batch size). If reservation fails, flush
+/// the coalescer, spill all sorted runs to disk, then retry.
+///- Push batch into the coalescer.
+///- If the coalescer reached its target: sort the coalesced batch
+/// and store as a new sorted run.
///
-/// 2.2 if no more memory is available, sort all buffered batches and
-/// spill to file. buffer the next batch in memory, go to 1.
-///
-/// 3. when input is exhausted, merge all in memory batches and spills
-///to get a total order.
+/// 2. When input is exhausted, merge all sorted runs (and any spill
+///files) to produce a total order.
///
/// # When data fits in available memory
///
-/// If there is sufficient memory, data is sorted in memory to produce the
output
+/// Sorted runs are merged in memory using a loser-tree k-way merge
+/// (via [`StreamingMergeBuilder`]).
///
/// ```text
-///┌─┐
-///│ 2 │
-///│ 3 │
-///│ 1 │─ ─ ─ ─ ─ ─ ─ ─ ─ ┐
-///│ 4 │
-///│ 2 │ │
-///└─┘ ▼
-///┌─┐
-///│ 1 │ In memory
-///│ 4 │─ ─ ─ ─ ─ ─▶ sort/merge ─ ─ ─ ─ ─▶ total sorted output
-///│ 1 │
-///└─┘ ▲
-/// ...│
-///
-///┌─┐ │
-///│ 4 │
-///│ 3 │─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
-///└─┘
-///
-/// in_mem_batches
+/// ┌──┐ ┌┐ ┌──┐ ┌┐
+/// │ Incoming │▶│ Batch │▶│ Sort │▶│ Sorted Run │
+/// │ Batches │ │ Coalescer │ │ │ │ (in memory)│
+/// └──┘ └┘ └──┘ └─┬──┘
+/// │
+/// ┌──┘
+/// ▼
+///k-way merge (loser tree)
+/// │
+/// ▼
+///total sorted output
/// ```
///
/// # When data does not fit in available memory
///
-/// When memory is exhausted, data is first sorted and written to one
-/// or more spill files on disk:
-///
-/// ```text
-///┌─┐ .─.
-///│ 2 │ ( )
-///│ 3 │ │`─'│
-///│ 1 │─ ─ ─ ─ ─ ─ ─ │ ┌┐ │
-///│ 4 │ ││ │ 1 │░ │
-///│ 2 │ │ │... │░ │
-///└─┘ ▼│ │ 4 │░ ┌ ─ ─ │
-///┌─┐ │ └┘░1 │░ │
-///│ 1 │ In memory│ ░░ │░░ │
-///│ 4 │─ ─ ▶ sort/merge─ ─ ─ ─ ┼ ─ ─ ─ ─ ─▶ ... │░ │
-///│ 1 │ and write to file│ │░░ │
-///└─┘ │ 4 │░ │
-/// ... ▲│ └░─░─░░ │
-///││░░ │
-///┌─┐ │.─.│
-///│ 4 │ │( )
-///│ 3 │─ ─ ─ ─ ─ ─ ─ `─'
-///└─┘
-///
-/// in_mem_batches spills
-/// (file on disk in Arrow
-/// IPC format)
-/// ```
-///
-/// Once the input is completely read, the spill files are read and
-/// merged with any in memory batches to produce a single total sorted
-/// output:
+/// When memory is exhausted, sorted runs are spilled directly to disk
+/// (one spill file per run — no merge needed since runs are already
+/// sorted). `MultiLevelMergeBuilder` handles the final merge from disk
+/// with dynamic fan-in.
///
/// ```text
-/// .─.
-/// ( )
-/// │`─'│
-/// │ ┌┐ │
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
adriangbot commented on PR #21629: URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4253363299 🤖 Criterion benchmark running (GKE) | [trigger](https://github.com/apache/datafusion/pull/21629#issuecomment-4253348811) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux bench-c4253348811-1301-2rfg4 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux` CPU Details (lscpu) ``` Architecture:aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 16 Socket(s): - Cluster(s): 1 Stepping:r0p1 BogoMIPS:2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache:32 MiB (16 instances) L3 cache:80 MiB (1 instance) NUMA node(s):1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling:Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1:Mitigation; __user pointer sanitization Vulnerability Spectre v2:Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected ``` Comparing externalsorter (689dfd2ad59416ddb75f8cc691ae787271da7805) to d0692b8 (merge-base) [diff](https://github.com/apache/datafusion/compare/d0692b8ee999ed72cd2a8f7000884618adfc06ee..689dfd2ad59416ddb75f8cc691ae787271da7805) BENCH_NAME=sort BENCH_COMMAND=cargo bench --features=parquet --bench sort BENCH_FILTER= Results will be posted here when complete --- [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) against this benchmark runner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
mbutrovich commented on PR #21629: URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4253348811 run benchmark sort -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
adriangbot commented on PR #21629: URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4247068843 🤖 Benchmark completed (GKE) | [trigger](https://github.com/apache/datafusion/pull/21629#issuecomment-4246955834) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) CPU Details (lscpu) ``` Architecture:aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 16 Socket(s): - Cluster(s): 1 Stepping:r0p1 BogoMIPS:2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache:32 MiB (16 instances) L3 cache:80 MiB (1 instance) NUMA node(s):1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling:Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1:Mitigation; __user pointer sanitization Vulnerability Spectre v2:Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected ``` Details ``` Comparing HEAD and externalsorter Benchmark tpcds_sf1.json ┏━━━┳┳┳━━━┓ ┃ Query ┃ HEAD ┃ externalsorter ┃Change ┃ ┡━━━╇╇╇━━━┩ │ QQuery 1 │ 49.74 / 50.95 ±0.80 / 52.26 ms │ 49.77 / 50.30 ±0.55 / 51.36 ms │ no change │ │ QQuery 2 │ 143.00 / 143.81 ±0.83 / 145.35 ms │ 138.66 / 139.17 ±0.36 / 139.58 ms │ no change │ │ QQuery 3 │ 162.98 / 163.26 ±0.21 / 163.53 ms │ 157.53 / 157.95 ±0.50 / 158.63 ms │ no change │ │ QQuery 4 │ 1612.56 / 1660.28 ±27.23 / 1688.23 ms │ 1600.19 / 1626.35 ±15.20 / 1645.67 ms │ no change │ │ QQuery 5 │ 269.80 / 279.75 ±5.67 / 286.05 ms │ 260.95 / 267.50 ±4.74 / 275.13 ms │ no change │ │ QQuery 6 │ 225.06 / 231.01 ±5.44 / 239.71 ms │ 213.47 / 216.62 ±3.33 / 220.82 ms │ +1.07x faster │ │ QQuery 7 │ 369.21 / 373.52 ±3.40 / 378.70 ms │ 362.94 / 364.15 ±1.10 / 365.71 ms │ no change │ │ QQuery 8 │ 176.29 / 180.13 ±2.51 / 183.75 ms │ 171.53 / 174.34 ±2.10 / 178.06 ms │ no change │ │ QQuery 9 │ 102.02 / 111.04 ±9.41 / 129.21 ms │ 100.99 / 103.19 ±2.14 / 106.76 ms │ +1.08x faster │ │ QQuery 10 │ 245.07 / 248.04 ±2.67 / 252.93 ms │ 233.82 / 238.41 ±2.95 / 242.19 ms │ no change │ │ QQuery 11 │ 808.76 / 824.14 ±10.53 / 836.10 ms │ 810.15 / 819.46 ±6.49 / 826.78 ms │ no change │ │ QQuery 12 │ 50.88 / 52.58 ±1.41 / 55.01 ms │ 50.86 / 51.56 ±0.71 / 52.81 ms │ no change │ │ QQuery 13 │ 399.54 / 401.90 ±2.09 / 405.30 ms │ 400.11 / 403.87 ±3.19 / 408.63 ms │ no change │ │ QQuery 14 │ 2028.96 / 2043.27 ±11.43 / 2062.97 ms │ 1902.55 / 1914.86 ±10.85 / 1932.72 ms │ +1.07x faster │ │ QQuery 15 │ 88.42 / 89.51 ±1.03 / 91.14 ms │ 87.27 / 87.73 ±0.45 / 88.54 ms │ no chang
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
adriangbot commented on PR #21629: URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4246979942 🤖 Benchmark completed (GKE) | [trigger](https://github.com/apache/datafusion/pull/21629#issuecomment-4246870829) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) CPU Details (lscpu) ``` Architecture:aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 16 Socket(s): - Cluster(s): 1 Stepping:r0p1 BogoMIPS:2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache:32 MiB (16 instances) L3 cache:80 MiB (1 instance) NUMA node(s):1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling:Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1:Mitigation; __user pointer sanitization Vulnerability Spectre v2:Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected ``` Details ``` Comparing HEAD and externalsorter Benchmark tpcds_sf1.json ┏━━━┳━━┳━━┳━━━┓ ┃ Query ┃ HEAD ┃ externalsorter ┃Change ┃ ┡━━━╇━━╇━━╇━━━┩ │ QQuery 1 │ 7.01 / 7.45 ±0.76 / 8.97 ms │ 6.86 / 7.33 ±0.76 / 8.86 ms │ no change │ │ QQuery 2 │143.89 / 145.42 ±1.02 / 146.77 ms │144.14 / 144.53 ±0.34 / 144.98 ms │ no change │ │ QQuery 3 │113.87 / 114.65 ±0.88 / 116.36 ms │112.84 / 113.44 ±0.33 / 113.81 ms │ no change │ │ QQuery 4 │1401.22 / 1414.73 ±10.96 / 1426.97 ms │1375.61 / 1415.42 ±34.40 / 1474.26 ms │ no change │ │ QQuery 5 │172.81 / 173.85 ±0.86 / 175.03 ms │171.04 / 173.32 ±1.69 / 176.20 ms │ no change │ │ QQuery 6 │ 849.72 / 867.23 ±21.92 / 910.12 ms │ 844.09 / 868.62 ±14.87 / 886.90 ms │ no change │ │ QQuery 7 │342.84 / 346.08 ±2.21 / 348.99 ms │340.64 / 346.49 ±3.18 / 349.42 ms │ no change │ │ QQuery 8 │114.66 / 118.26 ±2.55 / 121.27 ms │115.19 / 117.12 ±1.25 / 119.03 ms │ no change │ │ QQuery 9 │102.77 / 106.97 ±2.19 / 108.84 ms │101.33 / 104.25 ±2.34 / 107.23 ms │ no change │ │ QQuery 10 │106.87 / 108.20 ±1.11 / 109.99 ms │106.29 / 107.45 ±0.83 / 108.26 ms │ no change │ │ QQuery 11 │ 979.18 / 996.03 ±12.64 / 1016.91 ms │ 978.22 / 994.80 ±11.45 / 1005.19 ms │ no change │ │ QQuery 12 │ 45.78 / 46.90 ±0.77 / 47.81 ms │ 46.11 / 46.62 ±0.56 / 47.52 ms │ no change │ │ QQuery 13 │400.14 / 406.86 ±4.06 / 411.44 ms │399.86 / 401.85 ±1.71 / 404.03 ms │ no change │ │ QQuery 14 │ 1005.02 / 1010.73 ±4.14 / 1016.44 ms │ 1006.11 / 1017.87 ±5.97 / 1022.22 ms │ no change │ │ QQuery 15 │ 15.86 / 16.77 ±0.
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
adriangbot commented on PR #21629: URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4246972434 🤖 Benchmark running (GKE) | [trigger](https://github.com/apache/datafusion/pull/21629#issuecomment-4246955834) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux bench-c4246955834-1261-fd4rq 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux` CPU Details (lscpu) ``` Architecture:aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 16 Socket(s): - Cluster(s): 1 Stepping:r0p1 BogoMIPS:2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache:32 MiB (16 instances) L3 cache:80 MiB (1 instance) NUMA node(s):1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling:Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1:Mitigation; __user pointer sanitization Vulnerability Spectre v2:Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected ``` Comparing externalsorter (32a688271e2c9cb67b47454a2bc72f3cb926cfee) to e6b32fe (merge-base) [diff](https://github.com/apache/datafusion/compare/e6b32fe66b4e8e313ff79a9c45585d1a4bc2a19e..32a688271e2c9cb67b47454a2bc72f3cb926cfee) using: tpcds Results will be posted here when complete --- [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) against this benchmark runner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
mbutrovich commented on PR #21629: URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4246955834 run benchmark tpcds ``` env: PREFER_HASH_JOIN: false ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
adriangbot commented on PR #21629: URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4246886420 Benchmark for [this request](https://github.com/apache/datafusion/pull/21629#issuecomment-4246870829) failed. Last 20 lines of output: Click to expand ``` struct_query_sql substr substr_index substring sum to_char to_hex to_local_time to_time to_timestamp topk_aggregate topk_repartition translate trim trunc unhex upper uuid window_query_sql with_hashes ``` --- [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) against this benchmark runner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
adriangbot commented on PR #21629: URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4246885961 🤖 Benchmark running (GKE) | [trigger](https://github.com/apache/datafusion/pull/21629#issuecomment-4246870829) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux bench-c4246870829-1256-zqlsf 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux` CPU Details (lscpu) ``` Architecture:aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 16 Socket(s): - Cluster(s): 1 Stepping:r0p1 BogoMIPS:2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache:32 MiB (16 instances) L3 cache:80 MiB (1 instance) NUMA node(s):1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling:Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1:Mitigation; __user pointer sanitization Vulnerability Spectre v2:Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected ``` Comparing externalsorter (32a688271e2c9cb67b47454a2bc72f3cb926cfee) to e6b32fe (merge-base) [diff](https://github.com/apache/datafusion/compare/e6b32fe66b4e8e313ff79a9c45585d1a4bc2a19e..32a688271e2c9cb67b47454a2bc72f3cb926cfee) using: tpcds Results will be posted here when complete --- [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) against this benchmark runner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
adriangbot commented on PR #21629: URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4246885904 🤖 Criterion benchmark running (GKE) | [trigger](https://github.com/apache/datafusion/pull/21629#issuecomment-4246870829) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux bench-c4246870829-1257-bwm8p 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux` CPU Details (lscpu) ``` Architecture:aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 16 Socket(s): - Cluster(s): 1 Stepping:r0p1 BogoMIPS:2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache:32 MiB (16 instances) L3 cache:80 MiB (1 instance) NUMA node(s):1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling:Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1:Mitigation; __user pointer sanitization Vulnerability Spectre v2:Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected ``` Comparing externalsorter (32a688271e2c9cb67b47454a2bc72f3cb926cfee) to e6b32fe (merge-base) [diff](https://github.com/apache/datafusion/compare/e6b32fe66b4e8e313ff79a9c45585d1a4bc2a19e..32a688271e2c9cb67b47454a2bc72f3cb926cfee) BENCH_NAME=tpch_sort BENCH_COMMAND=cargo bench --features=parquet --bench tpch_sort BENCH_FILTER= Results will be posted here when complete --- [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) against this benchmark runner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
Dandandan commented on PR #21629: URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4246870829 run benchmark tpcds tpch_sort -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
adriangbot commented on PR #21629: URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4246806188 🤖 Benchmark completed (GKE) | [trigger](https://github.com/apache/datafusion/pull/21629#issuecomment-4246734700) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) CPU Details (lscpu) ``` Architecture:aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 16 Socket(s): - Cluster(s): 1 Stepping:r0p1 BogoMIPS:2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache:32 MiB (16 instances) L3 cache:80 MiB (1 instance) NUMA node(s):1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling:Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1:Mitigation; __user pointer sanitization Vulnerability Spectre v2:Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected ``` Details ``` Comparing HEAD and externalsorter Benchmark tpch_sf10.json ┏━━━┳┳┳━━┓ ┃ Query ┃ HEAD ┃ externalsorter ┃ Change ┃ ┡━━━╇╇╇━━┩ │ QQuery 1 │ 371.99 / 376.96 ±2.69 / 379.67 ms │ 373.11 / 374.56 ±0.99 / 376.21 ms │no change │ │ QQuery 2 │ 136.84 / 139.64 ±2.41 / 142.60 ms │ 137.37 / 141.53 ±3.83 / 148.61 ms │no change │ │ QQuery 3 │ 290.81 / 297.53 ±3.84 / 301.08 ms │ 294.21 / 297.24 ±2.49 / 301.82 ms │no change │ │ QQuery 4 │ 157.06 / 158.25 ±1.04 / 160.18 ms │ 163.17 / 163.98 ±0.65 / 164.88 ms │no change │ │ QQuery 5 │ 428.84 / 434.96 ±5.27 / 444.66 ms │ 451.75 / 457.04 ±4.95 / 464.64 ms │ 1.05x slower │ │ QQuery 6 │ 132.66 / 134.17 ±1.23 / 135.89 ms │ 133.50 / 134.61 ±0.86 / 135.48 ms │no change │ │ QQuery 7 │ 563.02 / 587.09 ±15.45 / 606.59 ms │ 587.86 / 594.62 ±4.33 / 599.57 ms │no change │ │ QQuery 8 │ 483.62 / 492.86 ±7.21 / 503.79 ms │ 478.84 / 491.04 ±9.87 / 506.50 ms │no change │ │ QQuery 9 │ 707.85 / 714.91 ±6.70 / 724.34 ms │ 669.57 / 691.14 ±14.47 / 708.41 ms │no change │ │ QQuery 10 │ 333.70 / 347.25 ±7.06 / 354.05 ms │ 340.97 / 354.39 ±10.48 / 368.41 ms │no change │ │ QQuery 11 │ 104.30 / 112.04 ±5.42 / 119.20 ms │ 110.99 / 115.85 ±3.85 / 122.81 ms │no change │ │ QQuery 12 │ 202.29 / 204.71 ±2.28 / 208.32 ms │ 208.39 / 209.76 ±1.29 / 212.04 ms │no change │ │ QQuery 13 │ 305.63 / 319.39 ±7.94 / 328.85 ms │ 309.99 / 315.73 ±6.53 / 327.63 ms │no change │ │ QQuery 14 │ 180.75 / 182.10 ±0.99 / 183.64 ms │ 181.13 / 182.52 ±1.44 / 185.24 ms │no change │ │ QQuery 15 │ 338.13 / 342.58 ±2.38 / 345.31 ms │ 331.27 / 333.51 ±2.16 / 337.52 ms │no change │ │ QQuery 16 │ 88.02 / 90.32 ±1.74 / 92.47 ms │ 84.29 / 86.72 ±2.30 / 89.96 ms │no change │ │ QQuery 17 │ 771.91 / 803.56 ±18.47 / 824.96 ms
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
mbutrovich commented on PR #21629: URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4246757710 I also opened a new PR that scales our existing sort benchmark to larger numbers: https://github.com/apache/datafusion/pull/21630 In theory we should merge that first, then run the sort benchmark against `main` after that is in. In the meantime I can run that locally and update with results. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
adriangbot commented on PR #21629: URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4246749034 🤖 Benchmark running (GKE) | [trigger](https://github.com/apache/datafusion/pull/21629#issuecomment-4246734700) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux bench-c4246734700-1255-84ftb 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux` CPU Details (lscpu) ``` Architecture:aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 16 Socket(s): - Cluster(s): 1 Stepping:r0p1 BogoMIPS:2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache:32 MiB (16 instances) L3 cache:80 MiB (1 instance) NUMA node(s):1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling:Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1:Mitigation; __user pointer sanitization Vulnerability Spectre v2:Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected ``` Comparing externalsorter (32a688271e2c9cb67b47454a2bc72f3cb926cfee) to e6b32fe (merge-base) [diff](https://github.com/apache/datafusion/compare/e6b32fe66b4e8e313ff79a9c45585d1a4bc2a19e..32a688271e2c9cb67b47454a2bc72f3cb926cfee) using: tpch10 Results will be posted here when complete --- [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) against this benchmark runner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
Dandandan commented on PR #21629: URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4246734700 run benchmark tpch10 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
Dandandan commented on PR #21629: URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4246733187 Lets compare to hash join run benchmark tpch10 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
Dandandan commented on PR #21629: URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4246726760 Wooh, nice results -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
adriangbot commented on PR #21629: URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4246681907 🤖 Benchmark completed (GKE) | [trigger](https://github.com/apache/datafusion/pull/21629#issuecomment-4246572576) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) CPU Details (lscpu) ``` Architecture:aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 16 Socket(s): - Cluster(s): 1 Stepping:r0p1 BogoMIPS:2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache:32 MiB (16 instances) L3 cache:80 MiB (1 instance) NUMA node(s):1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling:Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1:Mitigation; __user pointer sanitization Vulnerability Spectre v2:Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected ``` Details ``` Comparing HEAD and externalsorter Benchmark tpch_sf10.json ┏━━━┳┳┳━━━┓ ┃ Query ┃ HEAD ┃ externalsorter ┃Change ┃ ┡━━━╇╇╇━━━┩ │ QQuery 1 │ 369.64 / 376.08 ±5.23 / 382.99 ms │ 369.46 / 374.50 ±3.39 / 378.57 ms │ no change │ │ QQuery 2 │ 485.14 / 497.33 ±11.38 / 515.80 ms │ 438.26 / 446.15 ±6.06 / 456.09 ms │ +1.11x faster │ │ QQuery 3 │ 589.99 / 660.25 ±37.61 / 696.76 ms │ 514.23 / 531.74 ±9.61 / 542.05 ms │ +1.24x faster │ │ QQuery 4 │ 407.91 / 473.96 ±36.86 / 509.58 ms │ 334.86 / 341.61 ±6.01 / 352.42 ms │ +1.39x faster │ │ QQuery 5 │ 1118.92 / 1156.54 ±22.75 / 1181.28 ms │ 1025.31 / 1066.24 ±22.28 / 1090.89 ms │ +1.08x faster │ │ QQuery 6 │ 133.59 / 136.32 ±2.93 / 141.42 ms │ 134.30 / 140.19 ±6.63 / 152.36 ms │ no change │ │ QQuery 7 │ 1563.72 / 1581.82 ±16.09 / 1609.94 ms │ 1361.03 / 1383.17 ±20.21 / 1417.51 ms │ +1.14x faster │ │ QQuery 8 │ 1471.30 / 1899.23 ±344.46 / 2196.56 ms │ 1160.48 / 1254.40 ±138.94 / 1530.54 ms │ +1.51x faster │ │ QQuery 9 │ 2125.14 / 2373.70 ±187.56 / 2699.90 ms │ 1781.61 / 1916.18 ±81.59 / 2024.46 ms │ +1.24x faster │ │ QQuery 10 │ 544.27 / 550.30 ±3.86 / 555.77 ms │ 495.92 / 510.99 ±10.93 / 524.00 ms │ +1.08x faster │ │ QQuery 11 │ 455.29 / 469.03 ±11.12 / 484.05 ms │ 426.61 / 433.24 ±5.62 / 442.76 ms │ +1.08x faster │ │ QQuery 12 │ 295.00 / 301.36 ±4.84 / 308.51 ms │ 279.36 / 282.79 ±2.83 / 286.06 ms │ +1.07x faster │ │ QQuery 13 │ 370.67 / 377.16 ±7.15 / 391.07 ms │ 344.14 / 353.84 ±5.61 / 360.05 ms │ +1.07x faster │ │ QQuery 14 │ 193.64 / 199.80 ±5.22 / 207.45 ms │ 191.75 / 199.09 ±6.65 / 209.34 ms │ no change │ │ QQuery 15 │ 322.12 / 329.85 ±6.15 / 339.31 ms │ 321.21 / 329.66 ±7.24 / 342.24 ms │ no chang
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
mbutrovich commented on code in PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#discussion_r3081924897
##
datafusion/common/src/config.rs:
##
@@ -555,7 +555,17 @@ config_namespace! {
/// When sorting, below what size should data be concatenated
/// and sorted in a single RecordBatch rather than sorted in
/// batches and merged.
-pub sort_in_place_threshold_bytes: usize, default = 1024 * 1024
+///
+/// Deprecated: this option is no longer used. The sort pipeline
+/// now always coalesces batches before sorting. Use
+/// `sort_coalesce_target_rows` instead.
+pub sort_in_place_threshold_bytes: usize, warn =
"`sort_in_place_threshold_bytes` is deprecated and ignored. Use
`sort_coalesce_target_rows` instead.", default = 1024 * 1024
+
+/// Target number of rows to coalesce before sorting in ExternalSorter.
+///
+/// Larger values reduce merge fan-in by producing fewer, larger
+/// sorted runs.
+pub sort_coalesce_target_rows: usize, default = 32768
Review Comment:
Yeah I suspect we'd want to do a good sensitivity analysis on different
types and batch sizes for `lexsort_to_indices` (and eventually the radix sort
kernel). We might hit a point of diminishing returns/cache friendliness if our
coalesced batches get too large.
This design also first spills from the sorted runs, so holding more unsorted
rows in the coalescer may make it more likely for us to trigger spilling.
I'm definitely of the mind that we can and should tune this, but unclear
what even a reasonable default right now would be. In Comet where we run TPC-H
SF 1000, for example, I suspect we'll want longer sorted runs.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
adriangbot commented on PR #21629: URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4246583949 🤖 Benchmark running (GKE) | [trigger](https://github.com/apache/datafusion/pull/21629#issuecomment-4246572576) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux bench-c4246572576-1254-cr9dg 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux` CPU Details (lscpu) ``` Architecture:aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 16 Socket(s): - Cluster(s): 1 Stepping:r0p1 BogoMIPS:2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache:32 MiB (16 instances) L3 cache:80 MiB (1 instance) NUMA node(s):1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling:Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1:Mitigation; __user pointer sanitization Vulnerability Spectre v2:Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected ``` Comparing externalsorter (32a688271e2c9cb67b47454a2bc72f3cb926cfee) to e6b32fe (merge-base) [diff](https://github.com/apache/datafusion/compare/e6b32fe66b4e8e313ff79a9c45585d1a4bc2a19e..32a688271e2c9cb67b47454a2bc72f3cb926cfee) using: tpch10 Results will be posted here when complete --- [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) against this benchmark runner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
mbutrovich commented on PR #21629: URL: https://github.com/apache/datafusion/pull/21629#issuecomment-4246572576 run benchmark tpch10 ``` env: PREFER_HASH_JOIN: false ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] perf: Coalesce batches before sorting in ExternalSorter to reduce merge fan-in [datafusion]
Dandandan commented on code in PR #21629:
URL: https://github.com/apache/datafusion/pull/21629#discussion_r3081762076
##
datafusion/common/src/config.rs:
##
@@ -555,7 +555,17 @@ config_namespace! {
/// When sorting, below what size should data be concatenated
/// and sorted in a single RecordBatch rather than sorted in
/// batches and merged.
-pub sort_in_place_threshold_bytes: usize, default = 1024 * 1024
+///
+/// Deprecated: this option is no longer used. The sort pipeline
+/// now always coalesces batches before sorting. Use
+/// `sort_coalesce_target_rows` instead.
+pub sort_in_place_threshold_bytes: usize, warn =
"`sort_in_place_threshold_bytes` is deprecated and ignored. Use
`sort_coalesce_target_rows` instead.", default = 1024 * 1024
+
+/// Target number of rows to coalesce before sorting in ExternalSorter.
+///
+/// Larger values reduce merge fan-in by producing fewer, larger
+/// sorted runs.
+pub sort_coalesce_target_rows: usize, default = 32768
Review Comment:
I wonder if we can make this somewhat adaptive: as we usually load
everything in memory, it seems for very large sets larger batches would be even
more favorable (e.g. use 10MiB "scratch space" for coalescing instead of 32KiB
rows would make sense if our data is 1GiB and perhaps be even faster?)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
