jecsand838 commented on issue #9231: URL: https://github.com/apache/arrow-rs/issues/9231#issuecomment-3775330527
@mzabaluev Thanks for the clear reproducer and sample file. This looks like a schema-resolution bug in `arrow-avro`. Per the Avro spec, when the reader schema is a union but the writer is not, the reader must select the first matching union branch and then recursively resolve against it (i.e., decode using the writer’s encoding with no union tag consumed). In this case, the writer has `array<int>`, and the reader requests `array<union<null,int>>`, which should be compatible. The fix for this should be implementable in the `(writer_non_union, reader_union)` resolver arm in `codec.rs`. I'll work on getting a PR up for this tonight. > Should the nullable reader schema extend the non-nullable writer schema? Yes -- according to the Avro spec, this is explicitly supported: - If reader is a union but writer is not, the reader must pick the first union branch that matches the writer schema and then that branch is recursively resolved against the writer schema. - And because arrays are resolved recursively on their item schemas, this applies to array elements too. So: - writer: array<int> - reader: array<["null","int"]> should be compatible, and Spark successfully reading the file is consistent with this expected behavior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
