pchintar opened a new issue, #9947:
URL: https://github.com/apache/arrow-rs/issues/9947

   ## Description
   
   Currently, `BatchCoalescer::push_batch_with_filter` materializes a filtered 
`RecordBatch` before coalescing it into output batches.
   
   This introduces unnecessary intermediate array allocations and duplicate 
value copies during filtered coalescing, especially for numeric and timestamp 
columns.
   
   ---
   
   ## Root Cause
   
   In `arrow-select/src/coalesce.rs`, filtered coalescing is structured as:
   
   ```text
   RecordBatch
     → filter_record_batch()
     → temporary filtered RecordBatch
     → push_batch()
     → coalesced output
   ````
   
   The current implementation is:
   
   ```rust
   pub fn push_batch_with_filter(
       &mut self,
       batch: RecordBatch,
       filter: &BooleanArray,
   ) -> Result<(), ArrowError> {
       let filtered_batch = filter_record_batch(&batch, filter)?;
       self.push_batch(filtered_batch)
   }
   ```
   
   This means selected values can be copied twice:
   
   ```text
   1. filter_record_batch() copies selected values into temporary filtered 
arrays
   2. push_batch() copies those values again into the coalescer output buffers
   ```
   
   ---
   
   ## Current Behavior
   
   For filtered batches:
   
   ```text
   1. Allocate temporary filtered arrays
   2. Build temporary filtered RecordBatch
   3. Copy selected values into temporary arrays
   4. Copy selected values again into coalescer buffers
   5. Drop temporary arrays and RecordBatch
   ```
   
   ### Implications
   
   * unnecessary temporary array allocations
   * duplicate value copies
   * extra null bitmap materialization
   * additional allocator and memory overhead
   * increased latency in filtered coalescing workloads
   
   ---
   
   ## Proposed Solution
   
   Filtered batch coalescing should ideally avoid materializing temporary 
filtered arrays for numeric and timestamp columns.
   
   Instead, selected values could be appended directly into the coalescer 
output buffers during filtered coalescing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to