Re: AVRO schema evolution: adding optional column with default fails deserialization

2019-07-30 Thread Martin Mucha
Thank you very much for in depth answer. I understand how it works now
better, will test it shortly.
Thank you for your time.

Martin.

út 30. 7. 2019 v 17:09 odesílatel Ryan Skraba  napsal:

> Hello!  It's the same issue in your example code as allegro, even with
> the SpecificDatumReader.
>
> This line: datumReader = new SpecificDatumReader<>(schema)
> should be: datumReader = new SpecificDatumReader<>(originalSchema, schema)
>
> In Avro, the original schema is commonly known as the writer schema
> (the instance that originally wrote the binary data).  Schema
> evolution applies when you are using the constructor of the
> SpecificDatumReader that takes *both* reader and writer schemas.
>
> As a concrete example, if your original schema was:
>
> {
>   "type": "record",
>   "name": "Simple",
>   "fields": [
> {"name": "id", "type": "int"},
> {"name": "name","type": "string"}
>   ]
> }
>
> And you added a field:
>
> {
>   "type": "record",
>   "name": "SimpleV2",
>   "fields": [
> {"name": "id", "type": "int"},
> {"name": "name", "type": "string"},
> {"name": "description","type": ["null", "string"]}
>   ]
> }
>
> You could do the following safely, assuming that Simple and SimpleV2
> classes are generated from the avro-maven-plugin:
>
> @Test
> public void testSerializeDeserializeEvolution() throws IOException {
>   // Write a Simple v1 to bytes using your exact method.
>   byte[] v1AsBytes = serialize(new Simple(1, "name1"), true, false);
>
>   // Read as Simple v2, same as your method but with the writer and
> reader schema.
>   DatumReader datumReader =
>   new SpecificDatumReader<>(Simple.getClassSchema(),
> SimpleV2.getClassSchema());
>   Decoder decoder = DecoderFactory.get().binaryDecoder(v1AsBytes, null);
>   SimpleV2 v2 = datumReader.read(null, decoder);
>
>   assertThat(v2.getId(), is(1));
>   assertThat(v2.getName(), is(new Utf8("name1")));
>   assertThat(v2.getDescription(), nullValue());
> }
>
> This demonstrates with two different schemas and SpecificRecords in
> the same test, but the same principle applies if it's the same record
> that has evolved -- you need to know the original schema that wrote
> the data in order to apply the schema that you're now using for
> reading.
>
> I hope this clarifies what you are looking for!
>
> All my best, Ryan
>
>
>
> On Tue, Jul 30, 2019 at 3:30 PM Martin Mucha  wrote:
> >
> > Thanks for answer.
> >
> > Actually I have exactly the same behavior with avro 1.9.0 and following
> deserializer in our other app, which uses strictly avro codebase, and
> failing with same exceptions. So lets leave "allegro" library and lots of
> other tools out of it in our discussion.
> > I can use whichever aproach. All I need is single way, where I can
> deserialize byte[] into class generated by avro-maven-plugin, and which
> will respect documentation regarding schema evolution. Currently we're
> using following deserializer and serializer, and these does not work when
> it comes to schema evolution. What is the correct way to serialize and
> deserializer avro data?
> >
> > I probably don't understand your mention about GenericRecord or
> GenericDatumReader. I tried to use GenericDatumReader in deserializer
> below, but then it seems I got back just GenericData$Record instance, which
> I can use then to access array of instances, which is not what I'm looking
> for(IIUC), since in that case I could have just use plain old JSON and
> deserialize it using jackson having no schema evolution problems at all. If
> that's correct, I'd rather stick to SpecificDatumReader, and somehow fix it
> if possible.
> >
> > What can be done? Or how schema evolution is intended to be used? I
> found a lots of question searching for this answer.
> >
> > thanks!
> > Martin.
> >
> > deserializer:
> >
> > public static  T deserialize(Class
> targetType,
> >byte[]
> data,
> >boolean
> useBinaryDecoder) {
> > try {
> > if (data == null) {
> > return null;
> > }
> >
> > log.trace("data='{}'",
> DatatypeConverter.printHexBinary(data));
> >
> > Schema schema = targetType.newInstance().getSchema();
> > DatumReader datumReader = new
> SpecificDatumReader<>(schema);
> > Decoder decoder = useBinaryDecoder
> > ? DecoderFactory.get().binaryDecoder(data, null)
> > : DecoderFactory.get().jsonDecoder(schema, new
> String(data));
> >
> > T result = targetType.cast(datumReader.read(null, decoder));
> > log.trace("deserialized data='{}'", result);
> > return result;
> > } catch (Exception ex) {
> > throw new SerializationException("Error deserializing data",
> ex);
> > }
> > }
> >
> > serializer:
> > public static  byte[] serialize(T data,
> boolean 

Re: AVRO schema evolution: adding optional column with default fails deserialization

2019-07-30 Thread Ryan Skraba
Hello!  It's the same issue in your example code as allegro, even with
the SpecificDatumReader.

This line: datumReader = new SpecificDatumReader<>(schema)
should be: datumReader = new SpecificDatumReader<>(originalSchema, schema)

In Avro, the original schema is commonly known as the writer schema
(the instance that originally wrote the binary data).  Schema
evolution applies when you are using the constructor of the
SpecificDatumReader that takes *both* reader and writer schemas.

As a concrete example, if your original schema was:

{
  "type": "record",
  "name": "Simple",
  "fields": [
{"name": "id", "type": "int"},
{"name": "name","type": "string"}
  ]
}

And you added a field:

{
  "type": "record",
  "name": "SimpleV2",
  "fields": [
{"name": "id", "type": "int"},
{"name": "name", "type": "string"},
{"name": "description","type": ["null", "string"]}
  ]
}

You could do the following safely, assuming that Simple and SimpleV2
classes are generated from the avro-maven-plugin:

@Test
public void testSerializeDeserializeEvolution() throws IOException {
  // Write a Simple v1 to bytes using your exact method.
  byte[] v1AsBytes = serialize(new Simple(1, "name1"), true, false);

  // Read as Simple v2, same as your method but with the writer and
reader schema.
  DatumReader datumReader =
  new SpecificDatumReader<>(Simple.getClassSchema(),
SimpleV2.getClassSchema());
  Decoder decoder = DecoderFactory.get().binaryDecoder(v1AsBytes, null);
  SimpleV2 v2 = datumReader.read(null, decoder);

  assertThat(v2.getId(), is(1));
  assertThat(v2.getName(), is(new Utf8("name1")));
  assertThat(v2.getDescription(), nullValue());
}

This demonstrates with two different schemas and SpecificRecords in
the same test, but the same principle applies if it's the same record
that has evolved -- you need to know the original schema that wrote
the data in order to apply the schema that you're now using for
reading.

I hope this clarifies what you are looking for!

All my best, Ryan



On Tue, Jul 30, 2019 at 3:30 PM Martin Mucha  wrote:
>
> Thanks for answer.
>
> Actually I have exactly the same behavior with avro 1.9.0 and following 
> deserializer in our other app, which uses strictly avro codebase, and failing 
> with same exceptions. So lets leave "allegro" library and lots of other tools 
> out of it in our discussion.
> I can use whichever aproach. All I need is single way, where I can 
> deserialize byte[] into class generated by avro-maven-plugin, and which will 
> respect documentation regarding schema evolution. Currently we're using 
> following deserializer and serializer, and these does not work when it comes 
> to schema evolution. What is the correct way to serialize and deserializer 
> avro data?
>
> I probably don't understand your mention about GenericRecord or 
> GenericDatumReader. I tried to use GenericDatumReader in deserializer below, 
> but then it seems I got back just GenericData$Record instance, which I can 
> use then to access array of instances, which is not what I'm looking 
> for(IIUC), since in that case I could have just use plain old JSON and 
> deserialize it using jackson having no schema evolution problems at all. If 
> that's correct, I'd rather stick to SpecificDatumReader, and somehow fix it 
> if possible.
>
> What can be done? Or how schema evolution is intended to be used? I found a 
> lots of question searching for this answer.
>
> thanks!
> Martin.
>
> deserializer:
>
> public static  T deserialize(Class 
> targetType,
>byte[] data,
>boolean 
> useBinaryDecoder) {
> try {
> if (data == null) {
> return null;
> }
>
> log.trace("data='{}'", DatatypeConverter.printHexBinary(data));
>
> Schema schema = targetType.newInstance().getSchema();
> DatumReader datumReader = new 
> SpecificDatumReader<>(schema);
> Decoder decoder = useBinaryDecoder
> ? DecoderFactory.get().binaryDecoder(data, null)
> : DecoderFactory.get().jsonDecoder(schema, new 
> String(data));
>
> T result = targetType.cast(datumReader.read(null, decoder));
> log.trace("deserialized data='{}'", result);
> return result;
> } catch (Exception ex) {
> throw new SerializationException("Error deserializing data", ex);
> }
> }
>
> serializer:
> public static  byte[] serialize(T data, boolean 
> useBinaryDecoder, boolean pretty) {
> try {
> if (data == null) {
> return new byte[0];
> }
>
> log.debug("data='{}'", data);
> Schema schema = data.getSchema();
> ByteArrayOutputStream byteArrayOutputStream = new 
> ByteArrayOutputStream();
> Encoder binaryEncoder = useBinaryDecoder
>   

Re: Reg: Avrojob schema validation option.

2019-07-30 Thread SB M
Hi ,

What I mean when using a Schema.parse(), we can set the validation of names
to false, using setValidation method.

But using avrojob there is no option to set this validation to false.

I want this validation option to make the parser validation to false.so
that I can use it my code.


What am trying to achieve is I have a avro schema with sub structure with
name has operators separated by dot(.), which is not a valid one when
schema gets parsing. It will throw error.

But there is an option to set this name validation to false while parsing
by using new schema.Parser().setValidation(false);

But with the AvroJob no option to set validation. I needed this feature.

Thanks,
Sree


On Tue, 30 Jul, 2019, 14:11 Ryan Skraba,  wrote:

> Hello!  I'm not sure I understand your question.  Some names are
> *required* with a specific format in the Avro specification
> (http://avro.apache.org/docs/1.8.2/spec.html#names)
>
> What are you looking to accomplish?  I can think of two scenarios that
> we've seen in the past: (1) anonymous records where the name has no
> interest, and (2) mapping a structure that supports arbitrary UTF-8
> names (like a database table) to a record with the same field names.
> Neither of those are supported in the Avro specification.
>
> For the first case (where we don't care about the record name), we
> just autogenerated a "safe" but unused record name.
>
> For the second case, we used a custom annotation on the field
> (something like "display.name") to contain the original value and
> generated a "safe" field name.
>
> In both cases, being safe means that it meets the Avro spec
> ([A-Za-z_][A-Za-z0-9_]*) and avoids collisions with other generated
> names.
>
> I hope this helps!  Ryan
>
> On Fri, Jul 26, 2019 at 1:52 PM SB M  wrote:
> >
> > Hi All,
> >
> >  Problem: I need a option to set name validation for schema parsing,
> when setting with avrojob and avromultipleinputs.
> >
> > Is there any way to set schema name validation to false currently, when
> am ho through source code am not able to find any options like that.
> >
> > Please give a some solution.
> >
> > Regards,
> > Sree.
> >
>


Re: AVRO schema evolution: adding optional column with default fails deserialization

2019-07-30 Thread Martin Mucha
Thanks for answer.

Actually I have exactly the same behavior with avro 1.9.0 and following
deserializer in our other app, which uses strictly avro codebase, and
failing with same exceptions. So lets leave "allegro" library and lots of
other tools out of it in our discussion.
I can use whichever aproach. All I need is single way, where I can
deserialize byte[] into class generated by avro-maven-plugin, and which
will respect documentation regarding schema evolution. Currently we're
using following deserializer and serializer, and these does not work when
it comes to schema evolution. What is the correct way to serialize and
deserializer avro data?

I probably don't understand your mention about GenericRecord or
GenericDatumReader. I tried to use GenericDatumReader in deserializer
below, but then it seems I got back just GenericData$Record instance, which
I can use then to access array of instances, which is not what I'm looking
for(IIUC), since in that case I could have just use plain old JSON and
deserialize it using jackson having no schema evolution problems at all. If
that's correct, I'd rather stick to SpecificDatumReader, and somehow fix it
if possible.

What can be done? Or how schema evolution is intended to be used? I found a
lots of question searching for this answer.

thanks!
Martin.

deserializer:

public static  T deserialize(Class
targetType,
   byte[] data,
   boolean
useBinaryDecoder) {
try {
if (data == null) {
return null;
}

log.trace("data='{}'", DatatypeConverter.printHexBinary(data));

Schema schema = targetType.newInstance().getSchema();
DatumReader datumReader = new
SpecificDatumReader<>(schema);
Decoder decoder = useBinaryDecoder
? DecoderFactory.get().binaryDecoder(data, null)
: DecoderFactory.get().jsonDecoder(schema, new
String(data));

T result = targetType.cast(datumReader.read(null, decoder));
log.trace("deserialized data='{}'", result);
return result;
} catch (Exception ex) {
throw new SerializationException("Error deserializing data",
ex);
}
}

serializer:
public static  byte[] serialize(T data,
boolean useBinaryDecoder, boolean pretty) {
try {
if (data == null) {
return new byte[0];
}

log.debug("data='{}'", data);
Schema schema = data.getSchema();
ByteArrayOutputStream byteArrayOutputStream = new
ByteArrayOutputStream();
Encoder binaryEncoder = useBinaryDecoder
?
EncoderFactory.get().binaryEncoder(byteArrayOutputStream, null)
: EncoderFactory.get().jsonEncoder(schema,
byteArrayOutputStream, pretty);

DatumWriter datumWriter = new
GenericDatumWriter<>(schema);
datumWriter.write(data, binaryEncoder);

binaryEncoder.flush();
byteArrayOutputStream.close();

byte[] result = byteArrayOutputStream.toByteArray();
log.debug("serialized data='{}'",
DatatypeConverter.printHexBinary(result));
return result;
} catch (IOException ex) {
throw new SerializationException(
"Can't serialize data='" + data, ex);
}
}

út 30. 7. 2019 v 13:48 odesílatel Ryan Skraba  napsal:

> Hello!  Schema evolution relies on both the writer and reader schemas
> being available.
>
> It looks like the allegro tool you are using is using the
> GenericDatumReader that assumes the reader and writer schema are the
> same:
>
>
> https://github.com/allegro/json-avro-converter/blob/json-avro-converter-0.2.8/converter/src/main/java/tech/allegro/schema/json2avro/converter/JsonAvroConverter.java#L83
>
> I do not believe that the "default" value is taken into account for
> data that is strictly missing from the binary input, just when a field
> is known to be in the reader schema but missing from the original
> writer.
>
> You may have more luck reading the GenericRecord with a
> GenericDatumReader with both schemas, and using the
> `convertToJson(record)`.
>
> I hope this is useful -- Ryan
>
>
>
> On Tue, Jul 30, 2019 at 10:20 AM Martin Mucha  wrote:
> >
> > Hi,
> >
> > I've got some issues/misunderstanding of AVRO schema evolution.
> >
> > When reading through avro documentation, for example [1], I understood,
> that schema evolution is supported, and if I added column with specified
> default, it should be backwards compatible (and even forward when I remove
> it again). Sounds great, so I added column defined as:
> >
> > {
> >   "name": "newColumn",
> >   "type": ["null","string"],
> >   "default": null,
> >   "doc": "something wrong"
> > }
> >
> > and try to consumer some topic 

Re: AVRO schema evolution: adding optional column with default fails deserialization

2019-07-30 Thread Ryan Skraba
Hello!  Schema evolution relies on both the writer and reader schemas
being available.

It looks like the allegro tool you are using is using the
GenericDatumReader that assumes the reader and writer schema are the
same:

https://github.com/allegro/json-avro-converter/blob/json-avro-converter-0.2.8/converter/src/main/java/tech/allegro/schema/json2avro/converter/JsonAvroConverter.java#L83

I do not believe that the "default" value is taken into account for
data that is strictly missing from the binary input, just when a field
is known to be in the reader schema but missing from the original
writer.

You may have more luck reading the GenericRecord with a
GenericDatumReader with both schemas, and using the
`convertToJson(record)`.

I hope this is useful -- Ryan



On Tue, Jul 30, 2019 at 10:20 AM Martin Mucha  wrote:
>
> Hi,
>
> I've got some issues/misunderstanding of AVRO schema evolution.
>
> When reading through avro documentation, for example [1], I understood, that 
> schema evolution is supported, and if I added column with specified default, 
> it should be backwards compatible (and even forward when I remove it again). 
> Sounds great, so I added column defined as:
>
> {
>   "name": "newColumn",
>   "type": ["null","string"],
>   "default": null,
>   "doc": "something wrong"
> }
>
> and try to consumer some topic having this schema from beginning, it fails 
> with message:
>
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 5
> at 
> org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:424)
> at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:290)
> at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
> at 
> org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
> at 
> org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179)
> at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
> at 
> org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:232)
> at 
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:222)
> at 
> org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
> at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
> at 
> org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179)
> at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
> at 
> org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:232)
> at 
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:222)
> at 
> org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
> at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
> at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145)
> at 
> tech.allegro.schema.json2avro.converter.JsonAvroConverter.convertToJson(JsonAvroConverter.java:83)
> to give a little bit more information. Avro schema defines one top level 
> type, having 2 fields. String describing type of message, and union of N 
> types. All N-1, non-modified types can be read, but one updated with 
> optional, default-having column cannot be read. I'm not sure if this design 
> is strictly speaking correct, but that's not the point (feel free to 
> criticise and recommend better approach!). I'm after schema evolution, which 
> seems not to be working.
>
>
> And if we alter type definition to:
>
> "type": "string",
> "default": ""
> it still does not work and generated error is:
>
> Caused by: org.apache.avro.AvroRuntimeException: Malformed data. Length is 
> negative: -1
> at org.apache.avro.io.BinaryDecoder.doReadBytes(BinaryDecoder.java:336)
> at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263)
> at 
> org.apache.avro.io.ResolvingDecoder.readString(ResolvingDecoder.java:201)
> at 
> org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:422)
> at 
> org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:414)
> at 
> org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:181)
> at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
> at 
> org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:232)
> at 
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:222)
> at 
> org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
> at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
> at 
> 

Re: Reg: Avrojob schema validation option.

2019-07-30 Thread Ryan Skraba
Hello!  I'm not sure I understand your question.  Some names are
*required* with a specific format in the Avro specification
(http://avro.apache.org/docs/1.8.2/spec.html#names)

What are you looking to accomplish?  I can think of two scenarios that
we've seen in the past: (1) anonymous records where the name has no
interest, and (2) mapping a structure that supports arbitrary UTF-8
names (like a database table) to a record with the same field names.
Neither of those are supported in the Avro specification.

For the first case (where we don't care about the record name), we
just autogenerated a "safe" but unused record name.

For the second case, we used a custom annotation on the field
(something like "display.name") to contain the original value and
generated a "safe" field name.

In both cases, being safe means that it meets the Avro spec
([A-Za-z_][A-Za-z0-9_]*) and avoids collisions with other generated
names.

I hope this helps!  Ryan

On Fri, Jul 26, 2019 at 1:52 PM SB M  wrote:
>
> Hi All,
>
>  Problem: I need a option to set name validation for schema parsing, when 
> setting with avrojob and avromultipleinputs.
>
> Is there any way to set schema name validation to false currently, when  am 
> ho through source code am not able to find any options like that.
>
> Please give a some solution.
>
> Regards,
> Sree.
>