yjshen commented on code in PR #2132: URL: https://github.com/apache/arrow-datafusion/pull/2132#discussion_r841342087
########## datafusion/core/src/physical_plan/sorts/sort.rs: ########## @@ -105,13 +107,21 @@ impl ExternalSorter { } } - async fn insert_batch(&self, input: RecordBatch) -> Result<()> { + async fn insert_batch( + &self, + input: RecordBatch, + tracking_metrics: &MemTrackingMetrics, + ) -> Result<()> { if input.num_rows() > 0 { let size = batch_byte_size(&input); self.try_grow(size).await?; self.metrics.mem_used().add(size); let mut in_mem_batches = self.in_mem_batches.lock().await; - in_mem_batches.push(input); + // NB timer records time taken on drop, so there are no + // calls to `timer.done()` below. + let _timer = tracking_metrics.elapsed_compute().timer(); + let partial = sort_batch(input, self.schema.clone(), &self.expr)?; Review Comment: Performance would deteriorate significantly without this change: ``` Running benchmarks with the following options: DataFusionBenchmarkOpt { query: 1, debug: false, iterations: 3, partitions: 2, batch_size: 4096, path: "/home/yijie/sort_test/tpch-parquet", file_format: "parquet", mem_table: false, output_path: None } Query 1 iteration 0 took 4619.9 ms and returned 6001214 rows Query 1 iteration 1 took 4561.0 ms and returned 6001214 rows Query 1 iteration 2 took 4527.7 ms and returned 6001214 rows ``` The main reason I think is caused by random memory access while constructing output batches. Without this per-batch sort, while collecting cells from unsorted batches, the memory access would be fully randomized. With this per-batch sort, we are accessing memory linearly for each column in each batch, this would results in much predictable memory access pattern and benefits the CPU cache. I think the perf counter confirms the above speculation: ``` sudo perf stat -a -e cache-misses,cache-references,l3_cache_accesses,l3_misses,dTLB-load-misses,dTLB-loads target/release/tpch benchmark datafusion --iterations 3 --path /home/yijie/sort_test/tpch-parquet --format parquet --query 1 --batch-size 4096 ``` Without this per-batch sort: ``` Performance counter stats for 'system wide': 1,340,359,889 cache-misses # 35.817 % of all cache refs 3,742,289,458 cache-references 1,984,089,839 l3_cache_accesses 540,429,658 l3_misses 303,508,234 dTLB-load-misses # 49.51% of all dTLB cache accesses 613,048,439 dTLB-loads 14.222309739 seconds time elapsed ``` With this per-batch sort: ``` Performance counter stats for 'system wide': 1,059,913,512 cache-misses # 30.715 % of all cache refs 3,450,839,405 cache-references 1,388,975,765 l3_cache_accesses 235,570,805 l3_misses 239,390,511 dTLB-load-misses # 51.36% of all dTLB cache accesses 466,141,655 dTLB-loads 8.675278258 seconds time elapsed ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org