jecsand838 commented on PR #9462: URL: https://github.com/apache/arrow-rs/pull/9462#issuecomment-4014842806
> > ```rust > > let header_info = HeaderInfo::load_async(&mut input, file_size, None).await?; > > > > // derive reader_schema from writer_schema > > let reader_schema = make_reader_from_writer(header_info.writer_schema()?)?; > > > > let reader = ReaderBuilder::new(input, file_size, batch_size) > > .with_reader_schema(reader_schema) > > .with_header_info(header_info) // Optional to prevent re-reading the `Header` > > .with_projection(vec![0, 2]) > > .try_build() > > .await?; > > ``` > > Wouldn't this have the same problem of allowing an outside source of truth on the writer schema, header length, etc.? @mzabaluev Oh 100%. After thinking about it some more I began to align more with what @EmilyMatt mentioned. My opinion is the responsibility for providing a correct `HeaderInfo` / `Header` can be shifted to the caller. Meanwhile if the caller doesn't provide `HeaderInfo` then `ReaderBuilder` could just read the header like it currently does. I think the overall advantage of the larger approach though stems from centralizing the OCF Header logic in a re-usable manner while loosely coupling it to the readers. I can foresee other future use-cases such as a caller only wanting to check `HeaderInfo` to validate a file's writer schema and/or other metadata, etc that would benefit from this. It will also make it easier to maintain parity between both `Readers`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
