jecsand838 commented on code in PR #8006: URL: https://github.com/apache/arrow-rs/pull/8006#discussion_r2248715880
########## arrow-avro/src/schema.rs: ########## @@ -15,12 +15,27 @@ // specific language governing permissions and limitations // under the License. +use arrow_schema::ArrowError; use serde::{Deserialize, Serialize}; +use serde_json::{json, Value}; +use std::collections::hash_map::Entry; use std::collections::HashMap; +use strum_macros::AsRefStr; /// The metadata key used for storing the JSON encoded [`Schema`] pub const SCHEMA_METADATA_KEY: &str = "avro.schema"; +/// The Avro single‑object encoding “magic” bytes (`0xC3 0x01`) +pub const SINGLE_OBJECT_MAGIC: [u8; 2] = [0xC3, 0x01]; Review Comment: @scovich From the research I was doing it seems what prevents false positives is having the magic bytes **located at the start** of the message. Here's a twist to consider though, the `Decoder::decoder` method was developed to be generic enough to handle both: 1. OCF block encodings (blocks of Avro records without fingerprints, writer schema is determined from OCF header which has it's own 4 magic byte sequence) 2. Message single object encodings (an encoding of a single Avro record starting with the 2 byte magic + 8 byte fingerprint OR 1 byte magic + 4 byte id fingerprint if Confluent variant). I'm beginning to wonder if we'd benefit from having two decoder methods. For instance both OCF and the message format support schema evolution via writer and reader schemas. In message format the writer schema is determined via a fingerprint located on **each Avro message** that may or may not change between messages. Meanwhile in OCF format the writer schema is provided in the header and will remain static for the full file which in turn is a sequence of block encodings comprised of individual Avro records. My original plan was to use the `Reader` for OCF and the stand-alone `Decoder` for Messages (while being re-used in the `Reader`). I'm beginning to question that approach. A second alternative is to define a `DecodeMode` enum for the `Decoder` to used for identifying which decode flow is required. Either way I need to make some modifications based on this new research. I'll have something pushed up by this evening. Ty for bringing this up! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org