jecsand838 commented on PR #9462:
URL: https://github.com/apache/arrow-rs/pull/9462#issuecomment-3995541243
@mzabaluev @EmilyMatt
Thinking about the long-term public surface a bit more, I’m a little wary of
adding a second public builder stage / intermediate builder type in the async
API.
The reusable thing seems to be the parsed OCF header plus the discovered
`header_len` / offset. Specifically #9460 just needs to enable callers to
inspect the discovered writer schema before the reader is fully built.
Because of that, I’d lean toward exposing a small reusable `HeaderInfo`
value and keeping `ReaderBuilder` single-stage. Something roughly like:
```rust
#[derive(Clone)]
pub struct HeaderInfo(std::sync::Arc<HeaderInfoInner>);
struct HeaderInfoInner {
header: Header,
header_len: u64,
}
impl HeaderInfo {
/// load_async would get the logic currently in async
`ReaderBuilder::read_header()`
pub async fn load_async<R: AsyncFileReader>(
in: &mut R,
file_size: u64,
size_hint: Option<u64>,
) -> Result<Self, AvroError>;
/// load could replace the `read_header()` used by the sync
`ReaderBuilder`
pub fn load<R: Reader>(in: &mut R) -> Result<Self, AvroError>;
pub fn writer_schema(&self) -> Result<AvroSchema, AvroError>;
pub fn compression(&self) -> Result<Option<CompressionCodec>, AvroError>;
pub fn header_len(&self) -> u64 { self.0.header_len };
pub fn sync(&self) -> [u8; 16];
}
impl<R: AsyncFileReader> ReaderBuilder<R> {
pub fn with_header_info(self, header_info: HeaderInfo) -> Self;
}
```
and then usage along these lines:
```rust
let header_info = HeaderInfo::load_async(&mut input, file_size, None).await?;
// derive reader_schema from writer_schema
let reader_schema = make_reader_from_writer(header_info.writer_schema()?)?;
let reader = ReaderBuilder::new(input, file_size, batch_size)
.with_reader_schema(reader_schema)
.with_header_info(header_info) // Optional to prevent re-reading the
`Header`
.with_projection(vec![0, 2])
.try_build()
.await?;
```
That feels closer to the Parquet `ArrowReaderMetadata` pattern: load once,
inspect/reuse later, then hand the reusable metadata back into the builder.
It also seems like a narrower semver surface with the added benefit of
centrally organizing the crate's `Header` reading logic. If we later want to
parse a single `HeaderInfo` to read many ranges (like was touch on above), the
same artifact composes naturally:
```rust
// Assuming this header is 100% compatible with both `input_a` and
`input_b`, caller is responsible for verifying
let header_info = HeaderInfo::load_async(&mut input_a, size_a, None).await?;
let a = ReaderBuilder::new(input_a, size_a, batch_size)
.with_header_info(header_info.clone())
.try_build()
.await?;
let b = ReaderBuilder::new(input_b, size_b, batch_size)
.with_header_info(header_info)
.try_build()
.await?;
```
I’m not attached to the exact naming here (`HeaderInfo`, `ParsedHeader`,
`HeaderMetadata`, etc.), or necessarily this exact approach. My main concern is
just avoiding a second public builder if the durable information we want to
expose is essentially the reusable header metadata + the discovered offset.
Lastly, I'd consider implementing this behavior in the sync `ReaderBuilder`
as well to maintain parity. While it's definitely not relevant for the issue's
specific use-case, the general pattern of reading a file's writer schema to
inform a reader schema would be applicable to both readers.
Anyway, I'm definitely curious about what you all think?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]