[ 
https://issues.apache.org/jira/browse/AVRO-4238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18073408#comment-18073408
 ] 

Cedric Holzer commented on AVRO-4238:
-------------------------------------

Created a PR addressing the issue: [https://github.com/apache/avro/pull/3730]

> FastReader fails to unbox nested type when defaulting a union<array<>> field
> ----------------------------------------------------------------------------
>
>                 Key: AVRO-4238
>                 URL: https://issues.apache.org/jira/browse/AVRO-4238
>             Project: Apache Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.12.1
>         Environment: Java 21, Gradle 8.11, org.apache.avro:avro:1.12.1
>            Reporter: Cedric Holzer
>            Priority: Minor
>
> h2. Description
> When using the FastReader, the schema evolution fails with an 
> AvroRuntimeException when the reader schema adds a new field of type 
> union<array<T>, null> with a default value of an empty array.
> Fast Reader is enabled by default in 1.12.1, older versions are affected if 
> FastReader was enabled manually.
> h3. Cause
> In 
> [FastReaderBuilder.getDefaultingStep()|https://github.com/apache/avro/blob/4e376735ebbd14cc17e53116183039c8c4ced8ab/lang/java/avro/src/main/java/org/apache/avro/io/FastReaderBuilder.java#L191],
>  the fast path for non-empty lists calls 
> {code:java}
> data.newArray(old, 0, field.schema()) {code}
> field.schema() returns the union schema of the field. 
> [GenericData.newArray()|https://github.com/apache/avro/blob/4e376735ebbd14cc17e53116183039c8c4ced8ab/lang/java/avro/src/main/java/org/apache/avro/generic/GenericData.java#L1527]
>  internally calls schema.getElementType(), which is only valid for Array-type 
> schemas and therefore throws the error we see.
> h2. Steps to reproduce
> Run the following minimal reproducible example:
> {code:java}
> import org.apache.avro.Schema;
> import org.apache.avro.SchemaBuilder;
> import org.apache.avro.generic.GenericData;
> import org.apache.avro.generic.GenericDatumReader;
> import org.apache.avro.generic.GenericDatumWriter;
> import org.apache.avro.io.DatumReader;
> import org.apache.avro.io.DatumWriter;
> import org.apache.avro.io.Decoder;
> import org.apache.avro.io.DecoderFactory;
> import org.apache.avro.io.Encoder;
> import org.apache.avro.io.EncoderFactory;
> import org.apache.avro.specific.SpecificData;
> import java.io.ByteArrayInputStream;
> import java.io.ByteArrayOutputStream;
> import java.io.IOException;
> import static java.util.Collections.emptyList;
> public class AvroExample {
>     final static Schema EMPTY_RECORD = SchemaBuilder
>             .record("EmptyRecord")
>             .fields()
>             .endRecord();
>     // adds union<array<EmptyRecord>, null> someField = []
>     final static Schema READ_SCHEMA = SchemaBuilder.record("EvolvedRecord")
>             .fields()
>             .name("someField")
>             .type()
>             .unionOf()
>             .array()
>             .items(EMPTY_RECORD)
>             .and()
>             .nullType()
>             .endUnion()
>             .arrayDefault(emptyList())
>             .endRecord();
>     public static void main(String... args) throws IOException {
>         // Disable fast reader -> works as specified, enable -> throws 
> exception
>         GenericData model = SpecificData.get().setFastReaderEnabled(true);
>         // Serialize the empty record with the empty writer Schema
>         final Schema writeSchema = EMPTY_RECORD;
>         final byte[] serialized;
>         try (ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
>             Encoder encoder = EncoderFactory.get().binaryEncoder(baos, null);
>             DatumWriter<GenericData.Record> w = new 
> GenericDatumWriter<>(EMPTY_RECORD, model);
>             GenericData.Record emptyRecord = new 
> GenericData.Record(writeSchema);
>             w.write(emptyRecord, encoder);
>             encoder.flush();
>             serialized = baos.toByteArray();
>         }
>         // Deserialize with readSchema, Avro should create the new field with 
> its default value
>         try (ByteArrayInputStream bais = new 
> ByteArrayInputStream(serialized)) {
>             Decoder decoder = DecoderFactory.get().directBinaryDecoder(bais, 
> null);
>             DatumReader<GenericData.Record> r = new 
> GenericDatumReader<>(writeSchema, READ_SCHEMA, model);
>             final Object deserialized = r.read(null, decoder);
>             System.out.println(deserialized);
>         }
>     }
> }{code}
> h3. Expected Behaviour
> Deserialization succeeds, the new field someField was populated with its 
> default value, the program prints \{"someField": []}. This is the observable 
> behavior with .setFastReaderEnabled(false).
> h3. Actual Behaviour
> {noformat}
> Exception in thread "main" org.apache.avro.AvroRuntimeException: Not an 
> array: 
> [{"type":"array","items":{"type":"record","name":"EmptyRecord","fields":[]}},"null"]
>    at org.apache.avro.Schema.getElementType(Schema.java:374)       at 
> org.apache.avro.generic.GenericData.newArray(GenericData.java:1528)  at 
> org.apache.avro.io.FastReaderBuilder.lambda$getDefaultingStep$5(FastReaderBuilder.java:199)
>   at 
> org.apache.avro.io.FastReaderBuilder.lambda$createFieldSetter$1(FastReaderBuilder.java:181)
>   at 
> org.apache.avro.io.FastReaderBuilder$RecordReader.read(FastReaderBuilder.java:575)
>    at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:150) 
> at AvroExample.main(AvroExample.java:61){noformat}
> h2. Suggested Change
> In 
> [FastReaderBuilder.getDefaultingStep()|https://github.com/apache/avro/blob/4e376735ebbd14cc17e53116183039c8c4ced8ab/lang/java/avro/src/main/java/org/apache/avro/io/FastReaderBuilder.java#L198],
>  when the default value is an empty list, it could be checked whether the 
> type is an Union and if so, unbox the first child of type Array:
> {code:java}
> // Current (broken):
> (old, d) -> data.newArray(old, 0, field.schema())
> // Fix — unwrap union to find the array branch:
> Schema arraySchema = field.schema();
> if (arraySchema.getType() == Schema.Type.UNION) {
>     arraySchema = arraySchema.getTypes().stream()
>         .filter(s -> s.getType() == Schema.Type.ARRAY)
>         .findFirst()
>         .orElse(arraySchema);
> }
> (old, d) -> data.newArray(old, 0, arraySchema){code}
> h2. Workaround
> Disable FastReader by setting 
> {code:java}
> -Dorg.apache.avro.fastread=false{code}
> or change your type to be 
> {code:java}
> union<null, array<T>> = null{code}
> which works correctly but changes the default value to null instead of `[]`.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to