pchintar opened a new issue, #9835:
URL: https://github.com/apache/arrow-rs/issues/9835

   ## Description
   
   When writing IPC data using `StreamWriter` or `FileWriter`, the current 
implementation performs repeated heap allocations and full buffer copies for 
every record batch, even when writing batches with identical schema and 
structure.
   
   This leads to unnecessary latency overhead, especially in high-frequency 
batch writes and streaming pipelines.
   
   ---
   
   ## Root Cause
   
   Currently in `arrow-ipc/src/writer.rs`, the writer path is structured as:
   
   ```text
   RecordBatch
     → encode() → EncodedData
     → write_message()
   ```
   
   The key issue is that `EncodedData` owns its buffers:
   
   ```rust
   pub struct EncodedData {
       pub ipc_message: Vec<u8>,
       pub arrow_data: Vec<u8>,
   }
   ```
   
   This forces:
   
   * allocation of new buffers per batch
   * copying of flatbuffer data into `Vec<u8>`
   * destruction of all intermediate buffers after each write
   
   ---
   
   ## Current Behavior
   
   For every batch, the following occurs:
   
   ```text
   1. Build FlatBuffer (fbb)
   2. Copy it → ipc_message.to_vec() (Full Copy)
   3. Allocate arrow_data Vec      
   4. Allocate metadata vectors  
   5. Return EncodedData (owned)
   6. write_message() writes data
   7. All buffers dropped
   ```
   
   ### Implications
   
   * repeated heap allocations
   * repeated memory growth/reallocation
   * full flatbuffer copy per batch
   * memory churn (alloc → free → alloc)
   
   ---
   
   ## Proposed Solution
   
   For repeated batch writes, the writer should ideally:
   
   ```text
   1. Reuse FlatBufferBuilder
   2. Reuse arrow_data buffer
   3. Reuse metadata vectors
   4. Avoid copying flatbuffer data
   5. Write directly from existing buffers
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to