Re: [I] `ParseError("bad varint")` on reading array field with nullable elements in the reader schema [arrow-rs]

via GitHub Tue, 20 Jan 2026 14:54:08 -0800


jecsand838 commented on issue #9231:
URL: https://github.com/apache/arrow-rs/issues/9231#issuecomment-3775330527


   @mzabaluev 
   
   Thanks for the clear reproducer and sample file.
   
   This looks like a schema-resolution bug in `arrow-avro`. Per the Avro spec, 
when the reader schema is a union but the writer is not, the reader must select 
the first matching union branch and then recursively resolve against it (i.e., 
decode using the writer’s encoding with no union tag consumed).
   
   In this case, the writer has `array<int>`, and the reader requests 
`array<union<null,int>>`, which should be compatible.
   
   The fix for this should be implementable in the `(writer_non_union, 
reader_union)` resolver arm in `codec.rs`. I'll work on getting a PR up for 
this tonight.
   
   > Should the nullable reader schema extend the non-nullable writer schema?
   
   Yes -- according to the Avro spec, this is explicitly supported:
   - If reader is a union but writer is not, the reader must pick the first 
union branch that matches the writer schema and then that branch is recursively 
resolved against the writer schema.
   - And because arrays are resolved recursively on their item schemas, this 
applies to array elements too.
   
   So:
   - writer: array<int>
   - reader: array<["null","int"]>
   
   should be compatible, and Spark successfully reading the file is consistent 
with this expected behavior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] `ParseError("bad varint")` on reading array field with nullable elements in the reader schema [arrow-rs]

Reply via email to