jecsand838 commented on PR #9462:
URL: https://github.com/apache/arrow-rs/pull/9462#issuecomment-4014842806

   > > ```rust
   > > let header_info = HeaderInfo::load_async(&mut input, file_size, 
None).await?;
   > > 
   > > // derive reader_schema from writer_schema
   > > let reader_schema = 
make_reader_from_writer(header_info.writer_schema()?)?;
   > > 
   > > let reader = ReaderBuilder::new(input, file_size, batch_size)
   > >     .with_reader_schema(reader_schema)
   > >     .with_header_info(header_info) // Optional to prevent re-reading the 
`Header`
   > >     .with_projection(vec![0, 2])
   > >     .try_build()
   > >     .await?;
   > > ```
   > 
   > Wouldn't this have the same problem of allowing an outside source of truth 
on the writer schema, header length, etc.?
   
   @mzabaluev 
   
   Oh 100%. After thinking about it some more I began to align more with what 
@EmilyMatt mentioned. My opinion is the responsibility for providing a correct 
`HeaderInfo` / `Header` can be shifted to the caller. Meanwhile if the caller 
doesn't provide `HeaderInfo` then `ReaderBuilder` could just read the header 
like it currently does.
   
   I think the overall advantage of the larger approach though stems from 
centralizing the OCF Header logic in a re-usable manner while loosely coupling 
it to the readers. I can foresee other future use-cases such as a caller only 
wanting to check `HeaderInfo` to validate a file's writer schema and/or other 
metadata, etc that would benefit from this. It will also make it easier to 
maintain parity between both `Readers`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to