jecsand838 commented on issue #8928:
URL: https://github.com/apache/arrow-rs/issues/8928#issuecomment-3586108062
> Honestly I think my main interest is the cases where the ArrowSchema does
not contain any metadata. If I provide a valid arrow schema, do an
AvroSchema::try_from on it, and provide it as the read_schema, I expected the
two to be compatible. Currently it crashes, which is my main issue^^
That makes sense and I 100% understand where you're coming from.
Right now `AvroSchema::try_from(&ArrowSchema)` can only work with whatever
information is actually present in the `ArrowSchema`. When there’s no Avro
metadata at all, the Arrow side won't contain:
1. The original Avro record type names, namespaces, etc. (i.e.
`ns2.record2`), or
2. Any indication that a given `Struct` used to be a named Avro record.
Because `AvroSchema::try_from` was originally written for the `Writer`,
missing `Schema` metadata was interpreted as there being no originally sourced
Avro schema, therefore we were free to generate one with the information
available.
So for something like your nested `f1` struct, `AvroSchema::try_from` has to
invent a legal Avro record name. The current implementation does that by using
the field name (`"f1"`) when `avro.name` / `avro.namespace` aren’t present.
That means the generated reader schema ends up with:
```text
writer: nested record name = "record2" (namespace "ns2")
reader: nested record name = "f1" (no namespace)
```
When you then feed that generated schema into the resolver as `read_schema`,
Avro resolution correctly fails with a record‑name mismatch. From Avro’s
perspective those are two different types, even though the shape matches.
Essentially there are two slightly different expectations here:
1. **What the code currently guarantees:**
From an Arrow schema with no/missing Avro metadata, `arrow-avro` will
generate a valid Avro schema that correctly describes the Arrow layout, but we
can't promise it will be compatible with a pre‑existing Avro writer schema that
used specific named record types / namespaces, etc.
2. **More streamlined approach:**
If the ArrowSchema is structurally compatible with the writer schema, then
`AvroSchema::try_from` + `read_schema` should just work, even without metadata.
For schemas that don’t rely on Avro named types, those two line up and
things work fine. For named records (like your `record2`), they diverge because
Arrow simply doesn’t carry all the needed information unless we stash them in
metadata.
Practically speaking, the patterns available are:
1. If you already have the Avro writer schema JSON, reuse that as the
`read_schema` (or store it in `SCHEMA_METADATA_KEY` on the Arrow `Schema`),
**or**
2. When you do want to round‑trip via Arrow, attach `avro.name` /
`avro.namespace` on the relevant struct/dictionary fields so the generated
reader schema can match the writer’s names.
I totally agree this is not obvious from the API, and from your perspective
it just looks like you passed a perfectly valid Arrow schema and the resolution
crashed. So at a minimum we should:
* Document this limitation more clearly, and
* Consider whether we can support a more permissive resolution mode for
reader schemas derived from Arrow (i.e. fall back to structural matching when
the reader has no `avro.name`/`avro.namespace`), without breaking
spec‑compliant cases.
* Extend the public API to simplify as much of this metadata related
complexity as possible. Perhaps new logic to make it easy to derive a
resolution compatible reader schema from a specific writer schema?
If you’re interested, we could track that as a follow‑up enhancement:
something along the lines of "support using pure Arrow schemas as reader
schemas without requiring Avro naming metadata", and then discuss what
trade‑offs / flags that would need.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]