Re: [PR] Add arrow-avro examples and Reader documentation [arrow-rs]

via GitHub Fri, 12 Sep 2025 12:29:44 -0700


jecsand838 commented on code in PR #8316:
URL: https://github.com/apache/arrow-rs/pull/8316#discussion_r2345242356



##########
arrow-avro/src/reader/mod.rs:
##########
@@ -17,49 +17,86 @@
 
 //! Avro reader
 //!
-//! This module provides facilities to read Apache Avro-encoded files or 
streams
-//! into Arrow's `RecordBatch` format. In particular, it introduces:
+//! Facilities to read Apache Avro–encoded data into Arrow's `RecordBatch` 
format.
 //!
-//! * `ReaderBuilder`: Configures Avro reading, e.g., batch size
-//! * `Reader`: Yields `RecordBatch` values, implementing `Iterator`
-//! * `Decoder`: A low-level push-based decoder for Avro records
+//! This module exposes three layers of the API surface, from highest to 
lowest-level:
 //!
-//! # Basic Usage
+//! * `ReaderBuilder`: configures how Avro is read (batch size, strict union 
handling,
+//!   string representation, reader schema, etc.) and produces either:
+//!   * a `Reader` for **Avro Object Container Files (OCF)** read from any 
`BufRead`, or
+//!   * a low-level `Decoder` for **single‑object encoded** Avro bytes and 
Confluent
+//!     **Schema Registry** framed messages.
+//! * `Reader`: a convenient, synchronous iterator over `RecordBatch` decoded 
from an OCF
+//!   input. Implements [`Iterator<Item = Result<RecordBatch, ArrowError>>`] 
and
+//!   `RecordBatchReader`.
+//! * `Decoder`: a push‑based row decoder that consumes raw Avro bytes and 
yields ready
+//!   `RecordBatch` values when batches fill. This is suitable for integrating 
with async
+//!   byte streams, network protocols, or other custom data sources.
 //!
-//! `Reader` can be used directly with synchronous data sources, such as 
[`std::fs::File`].
+//! ## Encodings and when to use which type
 //!
-//! ## Reading a Single Batch
+//! * **Object Container File (OCF)**: A self‑describing file format with a 
header containing
+//!   the writer schema, optional compression codec, and a sync marker, 
followed by one or
+//!   more data blocks. Use `Reader` for this format. See the Avro 
specification for the
+//!   structure of OCF headers and blocks. 
<https://avro.apache.org/docs/1.11.1/specification/>
+//! * **Single‑Object Encoding**: A stream‑friendly framing that prefixes each 
record body with
+//!   the 2‑byte magic `0xC3 0x01` followed by a schema fingerprint. Use 
`Decoder` with a
+//!   populated `SchemaStore` to resolve fingerprints to full
+//!   schemas. <https://avro.apache.org/docs/1.11.1/specification/>
+//! * **Confluent Schema Registry wire format**: A 1‑byte magic `0x00`, a 
4‑byte big‑endian
+//!   schema ID, then the Avro‑encoded body. Use `Decoder` with a
+//!   `SchemaStore` configured for `FingerprintAlgorithm::None`
+//!   and entries keyed by `Fingerprint::Id`. Confluent docs
+//!   describe this framing.
+//!
+//! ## Basic file usage (OCF)
+//!
+//! Use `ReaderBuilder::build` to construct a `Reader` from any `BufRead`, 
such as a
+//! `BufReader<File>`. The reader yields `RecordBatch` values you can iterate 
over or collect.
+//!
+//! ```no_run

Review Comment:
   That's a solid call out. I'll get those changes up over the weekend.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Add arrow-avro examples and Reader documentation [arrow-rs]

Reply via email to