pchintar commented on PR #9836:
URL: https://github.com/apache/arrow-rs/pull/9836#issuecomment-4330216954

   Hi @alamb,
   
   So, I took a closer look at the `ipc_writer` benchmark & zstd path and the 
main cost seems to come from repeated calls to:
   
   ```rust
   compress_to_vec(buffer, ...)
   ```
   
   Right now the flow is strictly serial:
   
   ```text
   write_array_data
     → write_buffer
         → compress_to_vec (zstd)
   ```
   
   i.e.
   
   ```text
   buffer1 → compress → write
   buffer2 → compress → write
   ...
   ```
   
   Since buffers are independent, I’m considering restructuring this to:
   
   ```text
   collect buffers → compress in parallel → write in order
   ```
   
   Conceptually:
   
   ```text
   [buffer1, buffer2, buffer3]
           ↓
   parallel compress
           ↓
   append results (same order)
   ```
   
   This would keep the same IPC format and compression behavior, while avoiding 
any output-size tradeoff.
   
   Implementation-wise, something like:
   
   ```rust
   let parallelism = thread::available_parallelism()
       .map(|n| n.get())
       .unwrap_or(1)
       .min(4);
   ```
   
   Then process bounded chunks:
   
   ```rust
   for chunk in pending_buffers.chunks(parallelism) {
       let compressed = compress_chunk_in_parallel(chunk)?;
       append_in_original_order(compressed)?;
   }
   ```
   
   where each worker owns its compression context:
   
   ```rust
   let mut ctx = CompressionContext::default();
   codec.compress_to_vec(buffer.as_slice(), &mut out, &mut ctx)?;
   ```
   
   Would this kind of bounded per-batch parallelism be acceptable in 
`arrow-ipc`, or would it introduce any new hidden costs?
   
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to