jecsand838 opened a new pull request, #8006:
URL: https://github.com/apache/arrow-rs/pull/8006

   # Which issue does this PR close?
   
   - Part of https://github.com/apache/arrow-rs/issues/4886
   
   - Follow up to https://github.com/apache/arrow-rs/pull/7834
   
   # Rationale for this change
   
   Apache Avro’s [single object 
encoding](https://avro.apache.org/docs/1.11.1/specification/#single-object-encoding)
 prefixes every record with the marker `0xC3 0x01` followed by a `Rabin` 
[schema fingerprint 
](https://avro.apache.org/docs/1.11.1/specification/#schema-fingerprints) so 
that readers can identify the correct writer schema without carrying the full 
definition in each message. 
   While the current `arrow‑avro` implementation can read container files, it 
cannot ingest these framed messages or handle streams where the writer schema 
changes over time.
   
   The Avro specification recommends computing a 64‑bit CRC‑64‑AVRO (Rabin) 
hashed fingerprint of the [parsed canonical form of a 
schema](https://avro.apache.org/docs/1.11.1/specification/#parsing-canonical-form-for-schemas)
 to look up the `Schema` from a local schema store or registry.
   
   This PR introduces **`SchemaStore`** and **fingerprinting** to enable:
   
   * **Zero‑copy schema identification** for decoding streaming Avro messages 
published in single‑object format (i.e. Kafka, Pulsar, etc) into Arrow.  
   * **Dynamic schema evolution** by laying the foundation to resolve writer 
reader schema differences on the fly. 
   **NOTE:**  Schema Resolution support in `Codec` and `RecordDecoder` coming 
the next PR.
   
   # What changes are included in this PR?
   
   | Area                | Highlights                                           
                                                                                
                                                                                
                                                   |
   | ------------------- | 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 |
   | **`schema.rs`**     | *New* `Fingerprint`, `SchemaStore`, and 
`SINGLE_OBJECT_MAGIC`; canonical‑form generator; Rabin fingerprint calculator; 
`compare_schemas` helper.                                                       
             |
   | **`reader/mod.rs`** | Decoder now detects the `C3 01` prefix, extracts the 
fingerprint, looks up the writer schema in a `SchemaStore`, and switches to an 
LRU cached `RecordDecoder` without interrupting streaming; supports 
`static_store_mode` to skip the 2 byte peek for high‑throughput fixed‑schema 
pipelines. |
   | **`ReaderBuilder`** | New builder configuration methods: 
`.with_writer_schema_store`, `.with_active_fingerprint`, 
`.with_static_store_mode`, `.with_reader_schema`, 
`.with_max_decoder_cache_size`, with rigorous validation to prevent 
misconfiguration.                                                               
                     |
   | **`codec.rs`**      | Added `AvroFieldBuilder::with_reader_schema` and a 
stubbed `AvroField::resolve_from_writer_and_reader` entry point for full 
writer/reader schema resolution.                                                
                                                                    |
   | **Unit tests**      |  New tests covering fingerprint generation, store 
registration/lookup, schema switching, unknown‑fingerprint errors, and 
interaction with UTF8‑view decoding.                                            
                                                           |
   | **Docs & Examples** | Extensive inline docs with examples on all new 
public methods / structs.                                                       
                                                                     |
   
   ---
   
   # Are these changes tested?
   
   Yes.  New tests cover:
   
   1. **Fingerprinting** against the canonical examples from the Avro spec
   2. **`SchemaStore` behavior** deduplication, duplicate registration, and 
lookup.
   3. **Decoder fast‑path** with `static_store_mode=true`, ensuring the prefix 
is treated as payload, the 2 byte peek is skipped, and no schema switch is 
attempted.
   
   # Are there any user-facing changes?
   
   N/A
   
   # Follow-Up PRs
   
   1. Implement Schema Resolution Functionality in Codec and RecordDecoder
   2. Improve arrow-avro errors + add more benchmarks & examples to prepare for 
public release
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to