Re: AVRO schema evolution: adding optional column with default fails deserialization

Ryan Skraba Tue, 30 Jul 2019 08:09:59 -0700

Hello!  It's the same issue in your example code as allegro, even with
the SpecificDatumReader.


This line: datumReader = new SpecificDatumReader<>(schema)
should be: datumReader = new SpecificDatumReader<>(originalSchema, schema)

In Avro, the original schema is commonly known as the writer schema
(the instance that originally wrote the binary data).  Schema
evolution applies when you are using the constructor of the
SpecificDatumReader that takes *both* reader and writer schemas.

As a concrete example, if your original schema was:

{
  "type": "record",
  "name": "Simple",
  "fields": [
    {"name": "id", "type": "int"},
    {"name": "name","type": "string"}
  ]
}

And you added a field:

{
  "type": "record",
  "name": "SimpleV2",
  "fields": [
    {"name": "id", "type": "int"},
    {"name": "name", "type": "string"},
    {"name": "description","type": ["null", "string"]}
  ]
}

You could do the following safely, assuming that Simple and SimpleV2
classes are generated from the avro-maven-plugin:

@Test
public void testSerializeDeserializeEvolution() throws IOException {
  // Write a Simple v1 to bytes using your exact method.
  byte[] v1AsBytes = serialize(new Simple(1, "name1"), true, false);

  // Read as Simple v2, same as your method but with the writer and
reader schema.
  DatumReader<SimpleV2> datumReader =
      new SpecificDatumReader<>(Simple.getClassSchema(),
SimpleV2.getClassSchema());
  Decoder decoder = DecoderFactory.get().binaryDecoder(v1AsBytes, null);
  SimpleV2 v2 = datumReader.read(null, decoder);

  assertThat(v2.getId(), is(1));
  assertThat(v2.getName(), is(new Utf8("name1")));
  assertThat(v2.getDescription(), nullValue());
}

This demonstrates with two different schemas and SpecificRecords in
the same test, but the same principle applies if it's the same record
that has evolved -- you need to know the original schema that wrote
the data in order to apply the schema that you're now using for
reading.

I hope this clarifies what you are looking for!

All my best, Ryan



On Tue, Jul 30, 2019 at 3:30 PM Martin Mucha <alfon...@gmail.com> wrote:
>
> Thanks for answer.
>
> Actually I have exactly the same behavior with avro 1.9.0 and following 
> deserializer in our other app, which uses strictly avro codebase, and failing 
> with same exceptions. So lets leave "allegro" library and lots of other tools 
> out of it in our discussion.
> I can use whichever aproach. All I need is single way, where I can 
> deserialize byte[] into class generated by avro-maven-plugin, and which will 
> respect documentation regarding schema evolution. Currently we're using 
> following deserializer and serializer, and these does not work when it comes 
> to schema evolution. What is the correct way to serialize and deserializer 
> avro data?
>
> I probably don't understand your mention about GenericRecord or 
> GenericDatumReader. I tried to use GenericDatumReader in deserializer below, 
> but then it seems I got back just GenericData$Record instance, which I can 
> use then to access array of instances, which is not what I'm looking 
> for(IIUC), since in that case I could have just use plain old JSON and 
> deserialize it using jackson having no schema evolution problems at all. If 
> that's correct, I'd rather stick to SpecificDatumReader, and somehow fix it 
> if possible.
>
> What can be done? Or how schema evolution is intended to be used? I found a 
> lots of question searching for this answer.
>
> thanks!
> Martin.
>
> deserializer:
>
> public static <T extends SpecificRecordBase> T deserialize(Class<T> 
> targetType,
>                                                                byte[] data,
>                                                                boolean 
> useBinaryDecoder) {
>         try {
>             if (data == null) {
>                 return null;
>             }
>
>             log.trace("data='{}'", DatatypeConverter.printHexBinary(data));
>
>             Schema schema = targetType.newInstance().getSchema();
>             DatumReader<GenericRecord> datumReader = new 
> SpecificDatumReader<>(schema);
>             Decoder decoder = useBinaryDecoder
>                     ? DecoderFactory.get().binaryDecoder(data, null)
>                     : DecoderFactory.get().jsonDecoder(schema, new 
> String(data));
>
>             T result = targetType.cast(datumReader.read(null, decoder));
>             log.trace("deserialized data='{}'", result);
>             return result;
>         } catch (Exception ex) {
>             throw new SerializationException("Error deserializing data", ex);
>         }
>     }
>
> serializer:
> public static <T extends SpecificRecordBase> byte[] serialize(T data, boolean 
> useBinaryDecoder, boolean pretty) {
>         try {
>             if (data == null) {
>                 return new byte[0];
>             }
>
>             log.debug("data='{}'", data);
>             Schema schema = data.getSchema();
>             ByteArrayOutputStream byteArrayOutputStream = new 
> ByteArrayOutputStream();
>             Encoder binaryEncoder = useBinaryDecoder
>                     ? 
> EncoderFactory.get().binaryEncoder(byteArrayOutputStream, null)
>                     : EncoderFactory.get().jsonEncoder(schema, 
> byteArrayOutputStream, pretty);
>
>             DatumWriter<GenericRecord> datumWriter = new 
> GenericDatumWriter<>(schema);
>             datumWriter.write(data, binaryEncoder);
>
>             binaryEncoder.flush();
>             byteArrayOutputStream.close();
>
>             byte[] result = byteArrayOutputStream.toByteArray();
>             log.debug("serialized data='{}'", 
> DatatypeConverter.printHexBinary(result));
>             return result;
>         } catch (IOException ex) {
>             throw new SerializationException(
>                     "Can't serialize data='" + data, ex);
>         }
>     }
>
> út 30. 7. 2019 v 13:48 odesílatel Ryan Skraba <r...@skraba.com> napsal:
>>
>> Hello!  Schema evolution relies on both the writer and reader schemas
>> being available.
>>
>> It looks like the allegro tool you are using is using the
>> GenericDatumReader that assumes the reader and writer schema are the
>> same:
>>
>> https://github.com/allegro/json-avro-converter/blob/json-avro-converter-0.2.8/converter/src/main/java/tech/allegro/schema/json2avro/converter/JsonAvroConverter.java#L83
>>
>> I do not believe that the "default" value is taken into account for
>> data that is strictly missing from the binary input, just when a field
>> is known to be in the reader schema but missing from the original
>> writer.
>>
>> You may have more luck reading the GenericRecord with a
>> GenericDatumReader with both schemas, and using the
>> `convertToJson(record)`.
>>
>> I hope this is useful -- Ryan
>>
>>
>>
>> On Tue, Jul 30, 2019 at 10:20 AM Martin Mucha <alfon...@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > I've got some issues/misunderstanding of AVRO schema evolution.
>> >
>> > When reading through avro documentation, for example [1], I understood, 
>> > that schema evolution is supported, and if I added column with specified 
>> > default, it should be backwards compatible (and even forward when I remove 
>> > it again). Sounds great, so I added column defined as:
>> >
>> >         {
>> >           "name": "newColumn",
>> >           "type": ["null","string"],
>> >           "default": null,
>> >           "doc": "something wrong"
>> >         }
>> >
>> > and try to consumer some topic having this schema from beginning, it fails 
>> > with message:
>> >
>> > Caused by: java.lang.ArrayIndexOutOfBoundsException: 5
>> >     at 
>> > org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:424)
>> >     at 
>> > org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:290)
>> >     at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
>> >     at 
>> > org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
>> >     at 
>> > org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179)
>> >     at 
>> > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
>> >     at 
>> > org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:232)
>> >     at 
>> > org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:222)
>> >     at 
>> > org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
>> >     at 
>> > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
>> >     at 
>> > org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179)
>> >     at 
>> > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
>> >     at 
>> > org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:232)
>> >     at 
>> > org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:222)
>> >     at 
>> > org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
>> >     at 
>> > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
>> >     at 
>> > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145)
>> >     at 
>> > tech.allegro.schema.json2avro.converter.JsonAvroConverter.convertToJson(JsonAvroConverter.java:83)
>> > to give a little bit more information. Avro schema defines one top level 
>> > type, having 2 fields. String describing type of message, and union of N 
>> > types. All N-1, non-modified types can be read, but one updated with 
>> > optional, default-having column cannot be read. I'm not sure if this 
>> > design is strictly speaking correct, but that's not the point (feel free 
>> > to criticise and recommend better approach!). I'm after schema evolution, 
>> > which seems not to be working.
>> >
>> >
>> > And if we alter type definition to:
>> >
>> > "type": "string",
>> > "default": ""
>> > it still does not work and generated error is:
>> >
>> > Caused by: org.apache.avro.AvroRuntimeException: Malformed data. Length is 
>> > negative: -1
>> >     at org.apache.avro.io.BinaryDecoder.doReadBytes(BinaryDecoder.java:336)
>> >     at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263)
>> >     at 
>> > org.apache.avro.io.ResolvingDecoder.readString(ResolvingDecoder.java:201)
>> >     at 
>> > org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:422)
>> >     at 
>> > org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:414)
>> >     at 
>> > org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:181)
>> >     at 
>> > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
>> >     at 
>> > org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:232)
>> >     at 
>> > org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:222)
>> >     at 
>> > org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
>> >     at 
>> > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
>> >     at 
>> > org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179)
>> >     at 
>> > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
>> >     at 
>> > org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:232)
>> >     at 
>> > org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:222)
>> >     at 
>> > org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
>> >     at 
>> > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
>> >     at 
>> > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145)
>> >     at 
>> > tech.allegro.schema.json2avro.converter.JsonAvroConverter.convertToJson(JsonAvroConverter.java:83)
>> >
>> > Am I doing something wrong?
>> >
>> > thanks,
>> > Martin.
>> >
>> > [1] 
>> > https://docs.oracle.com/database/nosql-12.1.3.4/GettingStartedGuide/schemaevolution.html#changeschema-rules

Re: AVRO schema evolution: adding optional column with default fails deserialization

Reply via email to