yjshen commented on code in PR #2132:
URL: https://github.com/apache/arrow-datafusion/pull/2132#discussion_r841342087


##########
datafusion/core/src/physical_plan/sorts/sort.rs:
##########
@@ -105,13 +107,21 @@ impl ExternalSorter {
         }
     }
 
-    async fn insert_batch(&self, input: RecordBatch) -> Result<()> {
+    async fn insert_batch(
+        &self,
+        input: RecordBatch,
+        tracking_metrics: &MemTrackingMetrics,
+    ) -> Result<()> {
         if input.num_rows() > 0 {
             let size = batch_byte_size(&input);
             self.try_grow(size).await?;
             self.metrics.mem_used().add(size);
             let mut in_mem_batches = self.in_mem_batches.lock().await;
-            in_mem_batches.push(input);
+            // NB timer records time taken on drop, so there are no
+            // calls to `timer.done()` below.
+            let _timer = tracking_metrics.elapsed_compute().timer();
+            let partial = sort_batch(input, self.schema.clone(), &self.expr)?;

Review Comment:
   Performance would deteriorate significantly without this change:
   
   ```
   Running benchmarks with the following options: DataFusionBenchmarkOpt { 
query: 1, debug: false, iterations: 3, partitions: 2, batch_size: 4096, path: 
"/home/yijie/sort_test/tpch-parquet", file_format: "parquet", mem_table: false, 
output_path: None }
   Query 1 iteration 0 took 4619.9 ms and returned 6001214 rows
   Query 1 iteration 1 took 4561.0 ms and returned 6001214 rows
   Query 1 iteration 2 took 4527.7 ms and returned 6001214 rows
   ```
   The main reason I think is caused by random memory access while constructing 
output batches. Without this per-batch sort, while collecting cells from 
unsorted batches, the memory access would be fully randomized. With this 
per-batch sort, we are accessing memory linearly for each column in each batch, 
this would results in much predictable memory access pattern and benefits the 
CPU cache.
   
   I think the perf counter confirms the above speculation:
   ```
   sudo perf stat -a -e 
cache-misses,cache-references,l3_cache_accesses,l3_misses,dTLB-load-misses,dTLB-loads
 target/release/tpch benchmark datafusion --iterations 3 --path 
/home/yijie/sort_test/tpch-parquet --format parquet --query 1 --batch-size 4096
   ```
   
   Without this per-batch sort:
   ``` 
   Performance counter stats for 'system wide':
   
        1,340,359,889      cache-misses              #   35.817 % of all cache 
refs  
        3,742,289,458      cache-references                                     
     
        1,984,089,839      l3_cache_accesses                                    
     
          540,429,658      l3_misses                                            
     
          303,508,234      dTLB-load-misses          #   49.51% of all dTLB 
cache accesses
          613,048,439      dTLB-loads                                           
     
   
         14.222309739 seconds time elapsed
   ```
   
   With this per-batch sort:
   
   ```
    Performance counter stats for 'system wide':
   
        1,059,913,512      cache-misses              #   30.715 % of all cache 
refs  
        3,450,839,405      cache-references                                     
     
        1,388,975,765      l3_cache_accesses                                    
     
          235,570,805      l3_misses                                            
     
          239,390,511      dTLB-load-misses          #   51.36% of all dTLB 
cache accesses
          466,141,655      dTLB-loads                                           
     
   
          8.675278258 seconds time elapsed
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to