[
https://issues.apache.org/jira/browse/AVRO-4238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18073408#comment-18073408
]
Cedric Holzer commented on AVRO-4238:
-------------------------------------
Created a PR addressing the issue: [https://github.com/apache/avro/pull/3730]
> FastReader fails to unbox nested type when defaulting a union<array<>> field
> ----------------------------------------------------------------------------
>
> Key: AVRO-4238
> URL: https://issues.apache.org/jira/browse/AVRO-4238
> Project: Apache Avro
> Issue Type: Bug
> Components: java
> Affects Versions: 1.12.1
> Environment: Java 21, Gradle 8.11, org.apache.avro:avro:1.12.1
> Reporter: Cedric Holzer
> Priority: Minor
>
> h2. Description
> When using the FastReader, the schema evolution fails with an
> AvroRuntimeException when the reader schema adds a new field of type
> union<array<T>, null> with a default value of an empty array.
> Fast Reader is enabled by default in 1.12.1, older versions are affected if
> FastReader was enabled manually.
> h3. Cause
> In
> [FastReaderBuilder.getDefaultingStep()|https://github.com/apache/avro/blob/4e376735ebbd14cc17e53116183039c8c4ced8ab/lang/java/avro/src/main/java/org/apache/avro/io/FastReaderBuilder.java#L191],
> the fast path for non-empty lists calls
> {code:java}
> data.newArray(old, 0, field.schema()) {code}
> field.schema() returns the union schema of the field.
> [GenericData.newArray()|https://github.com/apache/avro/blob/4e376735ebbd14cc17e53116183039c8c4ced8ab/lang/java/avro/src/main/java/org/apache/avro/generic/GenericData.java#L1527]
> internally calls schema.getElementType(), which is only valid for Array-type
> schemas and therefore throws the error we see.
> h2. Steps to reproduce
> Run the following minimal reproducible example:
> {code:java}
> import org.apache.avro.Schema;
> import org.apache.avro.SchemaBuilder;
> import org.apache.avro.generic.GenericData;
> import org.apache.avro.generic.GenericDatumReader;
> import org.apache.avro.generic.GenericDatumWriter;
> import org.apache.avro.io.DatumReader;
> import org.apache.avro.io.DatumWriter;
> import org.apache.avro.io.Decoder;
> import org.apache.avro.io.DecoderFactory;
> import org.apache.avro.io.Encoder;
> import org.apache.avro.io.EncoderFactory;
> import org.apache.avro.specific.SpecificData;
> import java.io.ByteArrayInputStream;
> import java.io.ByteArrayOutputStream;
> import java.io.IOException;
> import static java.util.Collections.emptyList;
> public class AvroExample {
> final static Schema EMPTY_RECORD = SchemaBuilder
> .record("EmptyRecord")
> .fields()
> .endRecord();
> // adds union<array<EmptyRecord>, null> someField = []
> final static Schema READ_SCHEMA = SchemaBuilder.record("EvolvedRecord")
> .fields()
> .name("someField")
> .type()
> .unionOf()
> .array()
> .items(EMPTY_RECORD)
> .and()
> .nullType()
> .endUnion()
> .arrayDefault(emptyList())
> .endRecord();
> public static void main(String... args) throws IOException {
> // Disable fast reader -> works as specified, enable -> throws
> exception
> GenericData model = SpecificData.get().setFastReaderEnabled(true);
> // Serialize the empty record with the empty writer Schema
> final Schema writeSchema = EMPTY_RECORD;
> final byte[] serialized;
> try (ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
> Encoder encoder = EncoderFactory.get().binaryEncoder(baos, null);
> DatumWriter<GenericData.Record> w = new
> GenericDatumWriter<>(EMPTY_RECORD, model);
> GenericData.Record emptyRecord = new
> GenericData.Record(writeSchema);
> w.write(emptyRecord, encoder);
> encoder.flush();
> serialized = baos.toByteArray();
> }
> // Deserialize with readSchema, Avro should create the new field with
> its default value
> try (ByteArrayInputStream bais = new
> ByteArrayInputStream(serialized)) {
> Decoder decoder = DecoderFactory.get().directBinaryDecoder(bais,
> null);
> DatumReader<GenericData.Record> r = new
> GenericDatumReader<>(writeSchema, READ_SCHEMA, model);
> final Object deserialized = r.read(null, decoder);
> System.out.println(deserialized);
> }
> }
> }{code}
> h3. Expected Behaviour
> Deserialization succeeds, the new field someField was populated with its
> default value, the program prints \{"someField": []}. This is the observable
> behavior with .setFastReaderEnabled(false).
> h3. Actual Behaviour
> {noformat}
> Exception in thread "main" org.apache.avro.AvroRuntimeException: Not an
> array:
> [{"type":"array","items":{"type":"record","name":"EmptyRecord","fields":[]}},"null"]
> at org.apache.avro.Schema.getElementType(Schema.java:374) at
> org.apache.avro.generic.GenericData.newArray(GenericData.java:1528) at
> org.apache.avro.io.FastReaderBuilder.lambda$getDefaultingStep$5(FastReaderBuilder.java:199)
> at
> org.apache.avro.io.FastReaderBuilder.lambda$createFieldSetter$1(FastReaderBuilder.java:181)
> at
> org.apache.avro.io.FastReaderBuilder$RecordReader.read(FastReaderBuilder.java:575)
> at
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:150)
> at AvroExample.main(AvroExample.java:61){noformat}
> h2. Suggested Change
> In
> [FastReaderBuilder.getDefaultingStep()|https://github.com/apache/avro/blob/4e376735ebbd14cc17e53116183039c8c4ced8ab/lang/java/avro/src/main/java/org/apache/avro/io/FastReaderBuilder.java#L198],
> when the default value is an empty list, it could be checked whether the
> type is an Union and if so, unbox the first child of type Array:
> {code:java}
> // Current (broken):
> (old, d) -> data.newArray(old, 0, field.schema())
> // Fix — unwrap union to find the array branch:
> Schema arraySchema = field.schema();
> if (arraySchema.getType() == Schema.Type.UNION) {
> arraySchema = arraySchema.getTypes().stream()
> .filter(s -> s.getType() == Schema.Type.ARRAY)
> .findFirst()
> .orElse(arraySchema);
> }
> (old, d) -> data.newArray(old, 0, arraySchema){code}
> h2. Workaround
> Disable FastReader by setting
> {code:java}
> -Dorg.apache.avro.fastread=false{code}
> or change your type to be
> {code:java}
> union<null, array<T>> = null{code}
> which works correctly but changes the default value to null instead of `[]`.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)