jecsand838 opened a new pull request, #8006: URL: https://github.com/apache/arrow-rs/pull/8006
# Which issue does this PR close? - Part of https://github.com/apache/arrow-rs/issues/4886 - Follow up to https://github.com/apache/arrow-rs/pull/7834 # Rationale for this change Apache Avro’s [single object encoding](https://avro.apache.org/docs/1.11.1/specification/#single-object-encoding) prefixes every record with the marker `0xC3 0x01` followed by a `Rabin` [schema fingerprint ](https://avro.apache.org/docs/1.11.1/specification/#schema-fingerprints) so that readers can identify the correct writer schema without carrying the full definition in each message. While the current `arrow‑avro` implementation can read container files, it cannot ingest these framed messages or handle streams where the writer schema changes over time. The Avro specification recommends computing a 64‑bit CRC‑64‑AVRO (Rabin) hashed fingerprint of the [parsed canonical form of a schema](https://avro.apache.org/docs/1.11.1/specification/#parsing-canonical-form-for-schemas) to look up the `Schema` from a local schema store or registry. This PR introduces **`SchemaStore`** and **fingerprinting** to enable: * **Zero‑copy schema identification** for decoding streaming Avro messages published in single‑object format (i.e. Kafka, Pulsar, etc) into Arrow. * **Dynamic schema evolution** by laying the foundation to resolve writer reader schema differences on the fly. **NOTE:** Schema Resolution support in `Codec` and `RecordDecoder` coming the next PR. # What changes are included in this PR? | Area | Highlights | | ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | **`schema.rs`** | *New* `Fingerprint`, `SchemaStore`, and `SINGLE_OBJECT_MAGIC`; canonical‑form generator; Rabin fingerprint calculator; `compare_schemas` helper. | | **`reader/mod.rs`** | Decoder now detects the `C3 01` prefix, extracts the fingerprint, looks up the writer schema in a `SchemaStore`, and switches to an LRU cached `RecordDecoder` without interrupting streaming; supports `static_store_mode` to skip the 2 byte peek for high‑throughput fixed‑schema pipelines. | | **`ReaderBuilder`** | New builder configuration methods: `.with_writer_schema_store`, `.with_active_fingerprint`, `.with_static_store_mode`, `.with_reader_schema`, `.with_max_decoder_cache_size`, with rigorous validation to prevent misconfiguration. | | **`codec.rs`** | Added `AvroFieldBuilder::with_reader_schema` and a stubbed `AvroField::resolve_from_writer_and_reader` entry point for full writer/reader schema resolution. | | **Unit tests** | New tests covering fingerprint generation, store registration/lookup, schema switching, unknown‑fingerprint errors, and interaction with UTF8‑view decoding. | | **Docs & Examples** | Extensive inline docs with examples on all new public methods / structs. | --- # Are these changes tested? Yes. New tests cover: 1. **Fingerprinting** against the canonical examples from the Avro spec 2. **`SchemaStore` behavior** deduplication, duplicate registration, and lookup. 3. **Decoder fast‑path** with `static_store_mode=true`, ensuring the prefix is treated as payload, the 2 byte peek is skipped, and no schema switch is attempted. # Are there any user-facing changes? N/A # Follow-Up PRs 1. Implement Schema Resolution Functionality in Codec and RecordDecoder 2. Improve arrow-avro errors + add more benchmarks & examples to prepare for public release -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org