Re: AVRO schema evolution: adding optional column with default fails deserialization

Martin Mucha Thu, 01 Aug 2019 06:31:09 -0700

Hi,

just one more question, not strictly related to the subject.


Initially I though I'd be OK with using some initial version of schema in
place of writer schema. That works, but all columns from schema older than
this initial one would be just ignored. So I need to know EXACTLY the
schema, which writer used. I know, that avro messages contains either full
schema or at least it's ID. Can you point me to the documentation, where
this is discussed? So in my deserializer I have byte[] as a input, from
which I need to get the schema information first, in order to be able to
deserialize the record. I really do not know how to do that, I'm pretty
sure I never saw this anywhere, and I cannot find it anywhere. But in
principle it must be possible, since reader need not necessarily have any
control of which schema writer used.

thanks a lot.
M.

út 30. 7. 2019 v 18:16 odesílatel Martin Mucha <alfon...@gmail.com> napsal:

> Thank you very much for in depth answer. I understand how it works now
> better, will test it shortly.
> Thank you for your time.
>
> Martin.
>
> út 30. 7. 2019 v 17:09 odesílatel Ryan Skraba <r...@skraba.com> napsal:
>
>> Hello!  It's the same issue in your example code as allegro, even with
>> the SpecificDatumReader.
>>
>> This line: datumReader = new SpecificDatumReader<>(schema)
>> should be: datumReader = new SpecificDatumReader<>(originalSchema, schema)
>>
>> In Avro, the original schema is commonly known as the writer schema
>> (the instance that originally wrote the binary data).  Schema
>> evolution applies when you are using the constructor of the
>> SpecificDatumReader that takes *both* reader and writer schemas.
>>
>> As a concrete example, if your original schema was:
>>
>> {
>>   "type": "record",
>>   "name": "Simple",
>>   "fields": [
>>     {"name": "id", "type": "int"},
>>     {"name": "name","type": "string"}
>>   ]
>> }
>>
>> And you added a field:
>>
>> {
>>   "type": "record",
>>   "name": "SimpleV2",
>>   "fields": [
>>     {"name": "id", "type": "int"},
>>     {"name": "name", "type": "string"},
>>     {"name": "description","type": ["null", "string"]}
>>   ]
>> }
>>
>> You could do the following safely, assuming that Simple and SimpleV2
>> classes are generated from the avro-maven-plugin:
>>
>> @Test
>> public void testSerializeDeserializeEvolution() throws IOException {
>>   // Write a Simple v1 to bytes using your exact method.
>>   byte[] v1AsBytes = serialize(new Simple(1, "name1"), true, false);
>>
>>   // Read as Simple v2, same as your method but with the writer and
>> reader schema.
>>   DatumReader<SimpleV2> datumReader =
>>       new SpecificDatumReader<>(Simple.getClassSchema(),
>> SimpleV2.getClassSchema());
>>   Decoder decoder = DecoderFactory.get().binaryDecoder(v1AsBytes, null);
>>   SimpleV2 v2 = datumReader.read(null, decoder);
>>
>>   assertThat(v2.getId(), is(1));
>>   assertThat(v2.getName(), is(new Utf8("name1")));
>>   assertThat(v2.getDescription(), nullValue());
>> }
>>
>> This demonstrates with two different schemas and SpecificRecords in
>> the same test, but the same principle applies if it's the same record
>> that has evolved -- you need to know the original schema that wrote
>> the data in order to apply the schema that you're now using for
>> reading.
>>
>> I hope this clarifies what you are looking for!
>>
>> All my best, Ryan
>>
>>
>>
>> On Tue, Jul 30, 2019 at 3:30 PM Martin Mucha <alfon...@gmail.com> wrote:
>> >
>> > Thanks for answer.
>> >
>> > Actually I have exactly the same behavior with avro 1.9.0 and following
>> deserializer in our other app, which uses strictly avro codebase, and
>> failing with same exceptions. So lets leave "allegro" library and lots of
>> other tools out of it in our discussion.
>> > I can use whichever aproach. All I need is single way, where I can
>> deserialize byte[] into class generated by avro-maven-plugin, and which
>> will respect documentation regarding schema evolution. Currently we're
>> using following deserializer and serializer, and these does not work when
>> it comes to schema evolution. What is the correct way to serialize and
>> deserializer avro data?
>> >
>> > I probably don't understand your mention about GenericRecord or
>> GenericDatumReader. I tried to use GenericDatumReader in deserializer
>> below, but then it seems I got back just GenericData$Record instance, which
>> I can use then to access array of instances, which is not what I'm looking
>> for(IIUC), since in that case I could have just use plain old JSON and
>> deserialize it using jackson having no schema evolution problems at all. If
>> that's correct, I'd rather stick to SpecificDatumReader, and somehow fix it
>> if possible.
>> >
>> > What can be done? Or how schema evolution is intended to be used? I
>> found a lots of question searching for this answer.
>> >
>> > thanks!
>> > Martin.
>> >
>> > deserializer:
>> >
>> > public static <T extends SpecificRecordBase> T deserialize(Class<T>
>> targetType,
>> >                                                                byte[]
>> data,
>> >                                                                boolean
>> useBinaryDecoder) {
>> >         try {
>> >             if (data == null) {
>> >                 return null;
>> >             }
>> >
>> >             log.trace("data='{}'",
>> DatatypeConverter.printHexBinary(data));
>> >
>> >             Schema schema = targetType.newInstance().getSchema();
>> >             DatumReader<GenericRecord> datumReader = new
>> SpecificDatumReader<>(schema);
>> >             Decoder decoder = useBinaryDecoder
>> >                     ? DecoderFactory.get().binaryDecoder(data, null)
>> >                     : DecoderFactory.get().jsonDecoder(schema, new
>> String(data));
>> >
>> >             T result = targetType.cast(datumReader.read(null, decoder));
>> >             log.trace("deserialized data='{}'", result);
>> >             return result;
>> >         } catch (Exception ex) {
>> >             throw new SerializationException("Error deserializing
>> data", ex);
>> >         }
>> >     }
>> >
>> > serializer:
>> > public static <T extends SpecificRecordBase> byte[] serialize(T data,
>> boolean useBinaryDecoder, boolean pretty) {
>> >         try {
>> >             if (data == null) {
>> >                 return new byte[0];
>> >             }
>> >
>> >             log.debug("data='{}'", data);
>> >             Schema schema = data.getSchema();
>> >             ByteArrayOutputStream byteArrayOutputStream = new
>> ByteArrayOutputStream();
>> >             Encoder binaryEncoder = useBinaryDecoder
>> >                     ?
>> EncoderFactory.get().binaryEncoder(byteArrayOutputStream, null)
>> >                     : EncoderFactory.get().jsonEncoder(schema,
>> byteArrayOutputStream, pretty);
>> >
>> >             DatumWriter<GenericRecord> datumWriter = new
>> GenericDatumWriter<>(schema);
>> >             datumWriter.write(data, binaryEncoder);
>> >
>> >             binaryEncoder.flush();
>> >             byteArrayOutputStream.close();
>> >
>> >             byte[] result = byteArrayOutputStream.toByteArray();
>> >             log.debug("serialized data='{}'",
>> DatatypeConverter.printHexBinary(result));
>> >             return result;
>> >         } catch (IOException ex) {
>> >             throw new SerializationException(
>> >                     "Can't serialize data='" + data, ex);
>> >         }
>> >     }
>> >
>> > út 30. 7. 2019 v 13:48 odesílatel Ryan Skraba <r...@skraba.com> napsal:
>> >>
>> >> Hello!  Schema evolution relies on both the writer and reader schemas
>> >> being available.
>> >>
>> >> It looks like the allegro tool you are using is using the
>> >> GenericDatumReader that assumes the reader and writer schema are the
>> >> same:
>> >>
>> >>
>> https://github.com/allegro/json-avro-converter/blob/json-avro-converter-0.2.8/converter/src/main/java/tech/allegro/schema/json2avro/converter/JsonAvroConverter.java#L83
>> >>
>> >> I do not believe that the "default" value is taken into account for
>> >> data that is strictly missing from the binary input, just when a field
>> >> is known to be in the reader schema but missing from the original
>> >> writer.
>> >>
>> >> You may have more luck reading the GenericRecord with a
>> >> GenericDatumReader with both schemas, and using the
>> >> `convertToJson(record)`.
>> >>
>> >> I hope this is useful -- Ryan
>> >>
>> >>
>> >>
>> >> On Tue, Jul 30, 2019 at 10:20 AM Martin Mucha <alfon...@gmail.com>
>> wrote:
>> >> >
>> >> > Hi,
>> >> >
>> >> > I've got some issues/misunderstanding of AVRO schema evolution.
>> >> >
>> >> > When reading through avro documentation, for example [1], I
>> understood, that schema evolution is supported, and if I added column with
>> specified default, it should be backwards compatible (and even forward when
>> I remove it again). Sounds great, so I added column defined as:
>> >> >
>> >> >         {
>> >> >           "name": "newColumn",
>> >> >           "type": ["null","string"],
>> >> >           "default": null,
>> >> >           "doc": "something wrong"
>> >> >         }
>> >> >
>> >> > and try to consumer some topic having this schema from beginning, it
>> fails with message:
>> >> >
>> >> > Caused by: java.lang.ArrayIndexOutOfBoundsException: 5
>> >> >     at
>> org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:424)
>> >> >     at org.apache.avro.io
>> .ResolvingDecoder.doAction(ResolvingDecoder.java:290)
>> >> >     at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
>> >> >     at org.apache.avro.io
>> .ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
>> >> >     at
>> org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179)
>> >> >     at
>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
>> >> >     at
>> org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:232)
>> >> >     at
>> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:222)
>> >> >     at
>> org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
>> >> >     at
>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
>> >> >     at
>> org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179)
>> >> >     at
>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
>> >> >     at
>> org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:232)
>> >> >     at
>> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:222)
>> >> >     at
>> org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
>> >> >     at
>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
>> >> >     at
>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145)
>> >> >     at
>> tech.allegro.schema.json2avro.converter.JsonAvroConverter.convertToJson(JsonAvroConverter.java:83)
>> >> > to give a little bit more information. Avro schema defines one top
>> level type, having 2 fields. String describing type of message, and union
>> of N types. All N-1, non-modified types can be read, but one updated with
>> optional, default-having column cannot be read. I'm not sure if this design
>> is strictly speaking correct, but that's not the point (feel free to
>> criticise and recommend better approach!). I'm after schema evolution,
>> which seems not to be working.
>> >> >
>> >> >
>> >> > And if we alter type definition to:
>> >> >
>> >> > "type": "string",
>> >> > "default": ""
>> >> > it still does not work and generated error is:
>> >> >
>> >> > Caused by: org.apache.avro.AvroRuntimeException: Malformed data.
>> Length is negative: -1
>> >> >     at org.apache.avro.io
>> .BinaryDecoder.doReadBytes(BinaryDecoder.java:336)
>> >> >     at org.apache.avro.io
>> .BinaryDecoder.readString(BinaryDecoder.java:263)
>> >> >     at org.apache.avro.io
>> .ResolvingDecoder.readString(ResolvingDecoder.java:201)
>> >> >     at
>> org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:422)
>> >> >     at
>> org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:414)
>> >> >     at
>> org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:181)
>> >> >     at
>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
>> >> >     at
>> org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:232)
>> >> >     at
>> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:222)
>> >> >     at
>> org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
>> >> >     at
>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
>> >> >     at
>> org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179)
>> >> >     at
>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
>> >> >     at
>> org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:232)
>> >> >     at
>> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:222)
>> >> >     at
>> org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
>> >> >     at
>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
>> >> >     at
>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145)
>> >> >     at
>> tech.allegro.schema.json2avro.converter.JsonAvroConverter.convertToJson(JsonAvroConverter.java:83)
>> >> >
>> >> > Am I doing something wrong?
>> >> >
>> >> > thanks,
>> >> > Martin.
>> >> >
>> >> > [1]
>> https://docs.oracle.com/database/nosql-12.1.3.4/GettingStartedGuide/schemaevolution.html#changeschema-rules
>>
>

Re: AVRO schema evolution: adding optional column with default fails deserialization

Reply via email to