Shailesh-Kumar-Singh opened a new issue, #9499:
URL: https://github.com/apache/arrow-rs/issues/9499

   **Which part is this question about**
   <!--
   Is it code base, library api, documentation or some other part?
   -->
   Library API, specifically the interaction between ArrowRowGroupWriterFactory 
/ ArrowColumnChunk (sync, parallel encoding) and AsyncArrowWriter (async, 
sequential encoding).
   
   
   
   **Describe your question**
   We're building a high-throughput streaming k-way merge for sorted Parquet 
files. The write pipeline looks like:
   read (rayon decode + channel prefetch) → merge sort → parallel encode 
(rayon) → write to disk
   We want both parallel column encoding and async disk writes. Currently the 
API only allows picking one.
   ****Path A:** Parallel encode, sync write**
   
   
   ```
   let col_writers = rg_writer_factory.create_column_writers(rg_index)?;
   let chunks: Vec<ArrowColumnChunk> = rayon::install(|| {
       leaves_and_writers
           .into_par_iter()
           .map(|(leaf, mut col_writer)| {
               col_writer.write(&leaf)?;
               col_writer.close()
           })
           .collect()
   })?;
   
   // append_to_row_group requires sync SerializedFileWriter
   let mut rg = writer.next_row_group()?;
   for chunk in chunks {
       chunk.append_to_row_group(&mut rg)?;
   }
   rg.close()?;
   ```
   
   **Path B: Async write, sequential encode**
   
   
   ```
   let mut writer = AsyncArrowWriter::try_new(file, schema, Some(props))?;
   writer.write(&batch).await?;
   writer.close().await?;
   ```
   
   **The gap:** ArrowColumnChunk (the output of parallel encoding) can only be 
appended through sync SerializedFileWriter. There's no async equivalent.
   
   
   **Question:**
   Is there a way to combine parallel encoding with async writes
   
   
   
   
   
   <!--
   A clear and concise description of what the question is.
   -->
   
   **Additional context**
   Both read (decode) and write (encode) use a shared rayon pool for 
parallelism, the only sync bottleneck is the actual disk write inside 
append_to_row_group
   
   <!--
   Add any other context about the problem here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to