jecsand838 opened a new pull request, #9237:
URL: https://github.com/apache/arrow-rs/pull/9237

   # Which issue does this PR close?
   
   - Closes #9231.
   
   # Rationale for this change
   
   Avro schema resolution allows a reader schema to represent “nullable” values 
using a two-branch union (`["null", T]` or `[T, "null"]`) while still reading 
data written with the non-union schema `T` (i.e. without union discriminants in 
the encoded data).
   
   In `arrow-avro`, resolving a non-union writer type against a reader union 
(notably for array/list item schemas like `items: ["null", "int"]`) could 
incorrectly treat the encoded stream as a union and attempt to decode a union 
discriminant. This would misalign decoding and could surface as 
`ParseError("bad varint")` for certain files (see #9231).
   
   # What changes are included in this PR?
   
   - Fix schema resolution when the *writer* schema is non-union and the 
*reader* schema is a union:
     - Special-case two-branch unions containing `null` to be treated as 
“nullable” (capturing whether `null` is first or second), and resolve against 
the non-null branch.
     - Improve matching for general reader unions by attempting to resolve 
against each union variant, preferring a direct match, and constructing the 
appropriate union resolution mapping for the selected branch.
     - Ensure promotions are represented at the union-resolution level 
(avoiding nested promotion resolution on the selected union child).
   
   - Add regression coverage for the bug and the fixed behavior:
     - `test_resolve_array_writer_nonunion_items_reader_nullable_items` (schema 
resolution / codec)
     - `test_array_decoding_writer_nonunion_items_reader_nullable_items` 
(record decoding; ensures correct byte consumption and decoded values)
     - `test_bad_varint_bug_nullable_array_items` (end-to-end reader regression 
using a small Avro fixture)
   
   - Add a small compressed Avro fixture under 
`arrow-avro/test/data/bad-varint-bug.avro.gz` used by the regression test.
   
   # Are these changes tested?
   
   Yes. This PR adds targeted unit/integration tests that reproduce the prior 
failure mode and validate correct schema resolution and decoding for 
nullable-union array items.
   
   # Are there any user-facing changes?
   
   Yes (bug fix): reading Avro files with arrays whose element type is 
represented as a nullable union in the reader schema (e.g. `items: ["null", 
"int"]`) now succeeds instead of failing with `ParseError("bad varint")`. No 
public API changes are intended.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to