Re: [I] Avro schema parser uses type name instead of field name [AVRO] [arrow-rs]

via GitHub Thu, 27 Nov 2025 06:19:01 -0800


jecsand838 commented on issue #8928:
URL: https://github.com/apache/arrow-rs/issues/8928#issuecomment-3586108062


   > Honestly I think my main interest is the cases where the ArrowSchema does 
not contain any metadata. If I provide a valid arrow schema, do an 
AvroSchema::try_from on it, and provide it as the read_schema, I expected the 
two to be compatible. Currently it crashes, which is my main issue^^
   
   That makes sense and I 100% understand where you're coming from.
   
   Right now `AvroSchema::try_from(&ArrowSchema)` can only work with whatever 
information is actually present in the `ArrowSchema`. When there’s no Avro 
metadata at all, the Arrow side won't contain:
   1. The original Avro record type names, namespaces, etc. (i.e. 
`ns2.record2`), or
   2. Any indication that a given `Struct` used to be a named Avro record.
   
   Because `AvroSchema::try_from` was originally written for the `Writer`, 
missing `Schema` metadata was interpreted as there being no originally sourced 
Avro schema, therefore we were free to generate one with the information 
available.
   
   So for something like your nested `f1` struct, `AvroSchema::try_from` has to 
invent a legal Avro record name. The current implementation does that by using 
the field name (`"f1"`) when `avro.name` / `avro.namespace` aren’t present. 
That means the generated reader schema ends up with:
   
   ```text
   writer: nested record name  = "record2" (namespace "ns2")
   reader: nested record name  = "f1"      (no namespace)
   ```
   
   When you then feed that generated schema into the resolver as `read_schema`, 
Avro resolution correctly fails with a record‑name mismatch. From Avro’s 
perspective those are two different types, even though the shape matches.
   
   Essentially there are two slightly different expectations here:
   
   1. **What the code currently guarantees:**
     From an Arrow schema with no/missing Avro metadata, `arrow-avro` will 
generate a valid Avro schema that correctly describes the Arrow layout, but we 
can't promise it will be compatible with a pre‑existing Avro writer schema that 
used specific named record types / namespaces, etc.
   
   2. **More streamlined approach:**
     If the ArrowSchema is structurally compatible with the writer schema, then 
`AvroSchema::try_from` + `read_schema` should just work, even without metadata.
   
   For schemas that don’t rely on Avro named types, those two line up and 
things work fine. For named records (like your `record2`), they diverge because 
Arrow simply doesn’t carry all the needed information unless we stash them in 
metadata.
   
   Practically speaking, the patterns available are:
   
   1. If you already have the Avro writer schema JSON, reuse that as the 
`read_schema` (or store it in `SCHEMA_METADATA_KEY` on the Arrow `Schema`), 
**or**
   2. When you do want to round‑trip via Arrow, attach `avro.name` / 
`avro.namespace` on the relevant struct/dictionary fields so the generated 
reader schema can match the writer’s names.
   
   I totally agree this is not obvious from the API, and from your perspective 
it just looks like you passed a perfectly valid Arrow schema and the resolution 
crashed. So at a minimum we should:
   
   * Document this limitation more clearly, and
   * Consider whether we can support a more permissive resolution mode for 
reader schemas derived from Arrow (i.e. fall back to structural matching when 
the reader has no `avro.name`/`avro.namespace`), without breaking 
spec‑compliant cases.
   * Extend the public API to simplify as much of this metadata related 
complexity as possible. Perhaps new logic to make it easy to derive a 
resolution compatible reader schema from a specific writer schema?
   
   If you’re interested, we could track that as a follow‑up enhancement: 
something along the lines of "support using pure Arrow schemas as reader 
schemas without requiring Avro naming metadata", and then discuss what 
trade‑offs / flags that would need.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Avro schema parser uses type name instead of field name [AVRO] [arrow-rs]

Reply via email to