Re: [PR] feat: two-stage builder API for async Avro reader [arrow-rs]

via GitHub Tue, 03 Mar 2026 22:12:31 -0800


jecsand838 commented on PR #9462:
URL: https://github.com/apache/arrow-rs/pull/9462#issuecomment-3995541243


   @mzabaluev @EmilyMatt 
   
   Thinking about the long-term public surface a bit more, I’m a little wary of 
adding a second public builder stage / intermediate builder type in the async 
API.
   
   The reusable thing seems to be the parsed OCF header plus the discovered 
`header_len` / offset. Specifically #9460  just needs to enable callers to 
inspect the discovered writer schema before the reader is fully built.
   
   Because of that, I’d lean toward exposing a small reusable `HeaderInfo` 
value and keeping `ReaderBuilder` single-stage. Something roughly like:
   
   ```rust
   #[derive(Clone)]
   pub struct HeaderInfo(std::sync::Arc<HeaderInfoInner>);
   
   struct HeaderInfoInner {
       header: Header,
       header_len: u64,
   }
   
   impl HeaderInfo {
       /// load_async would get the logic currently in async 
`ReaderBuilder::read_header()`
       pub async fn load_async<R: AsyncFileReader>(
               in: &mut R,  
               file_size: u64, 
               size_hint: Option<u64>,
       ) -> Result<Self, AvroError>;
       /// load could replace the `read_header()` used by the sync 
`ReaderBuilder`
       pub fn load<R: Reader>(in: &mut R) -> Result<Self, AvroError>;    
       pub fn writer_schema(&self) -> Result<AvroSchema, AvroError>;
       pub fn compression(&self) -> Result<Option<CompressionCodec>, AvroError>;
       pub fn header_len(&self) -> u64 { self.0.header_len };
       pub fn sync(&self) -> [u8; 16];
   }
   
   impl<R: AsyncFileReader> ReaderBuilder<R> {
       pub fn with_header_info(self, header_info: HeaderInfo) -> Self;
   }
   ```
   
   and then usage along these lines:
   
   ```rust
   let header_info = HeaderInfo::load_async(&mut input, file_size, None).await?;
   
   // derive reader_schema from writer_schema
   let reader_schema = make_reader_from_writer(header_info.writer_schema()?)?;
   
   let reader = ReaderBuilder::new(input, file_size, batch_size)
       .with_reader_schema(reader_schema)
       .with_header_info(header_info) // Optional to prevent re-reading the 
`Header`
       .with_projection(vec![0, 2])
       .try_build()
       .await?;
   ```
   
   That feels closer to the Parquet `ArrowReaderMetadata` pattern: load once, 
inspect/reuse later, then hand the reusable metadata back into the builder.
   
   It also seems like a narrower semver surface with the added benefit of 
centrally organizing the crate's `Header` reading logic. If we later want to 
parse a single `HeaderInfo` to read many ranges (like was touch on above), the 
same artifact composes naturally:
   
   ```rust
   // Assuming this header is 100% compatible with both  `input_a` and 
`input_b`, caller is responsible for verifying
   let header_info = HeaderInfo::load_async(&mut input_a, size_a, None).await?; 
   
   let a = ReaderBuilder::new(input_a, size_a, batch_size)
       .with_header_info(header_info.clone())
       .try_build()
       .await?;
   
   let b = ReaderBuilder::new(input_b, size_b, batch_size)
       .with_header_info(header_info)
       .try_build()
       .await?;
   ```
   
   I’m not attached to the exact naming here (`HeaderInfo`, `ParsedHeader`, 
`HeaderMetadata`, etc.), or necessarily this exact approach. My main concern is 
just avoiding a second public builder if the durable information we want to 
expose is essentially the reusable header metadata + the discovered offset.
   
   Lastly, I'd consider implementing this behavior in the sync `ReaderBuilder` 
as well to maintain parity. While it's definitely not relevant for the issue's 
specific use-case, the general pattern of reading a file's writer schema to 
inform a reader schema would be applicable to both readers.
   
   Anyway, I'm definitely curious about what you all think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat: two-stage builder API for async Avro reader [arrow-rs]

Reply via email to