Re: how to configure not using Utf8 in avro-maven-plugin generate-sources
Thanks, looks nice. I need to dig (much) more into acpect oriented programming myself, so that I can use these stuff. To update about issue: actually, the configuration[1] of avro-maven-plugin works, what I was facing was some avro version incompatibility (between minor versions). After resolving that, basic types are correctly generated as String and correctly deserializes. However, what is not correctly handled, and I'm sure about it this time, is arrays. Field definition like: { "name": "codes", "type": ["null", { "type": "array", "name": "codeArray", "items": { "type": "string" } }], "default": null } will be cast in put method to List, but individual items of that array will be instances of Utf8. I'm writing you that, because your aspect does not handle this situation. Thanks for your help! M. [1] String PRIVATE po 21. 6. 2021 v 18:53 odesílatel Chad Preisler napsal: > I created a point cut to work around this issue. > > @Aspect > > public class SpecificRecordBasePutPointCut { > > > > public static final Logger LOGGER = > LoggerFactory.getLogger(SpecificRecordBasePutPointCut.class); > > > > @Pointcut("execution(* your.package.for.generated.code.*.put(int, > java.lang.Object)) && args(i, value)") > > void put(int i, java.lang.Object value) {} > > > > @Around("put(i, value)") > > public Object anyPutCall(ProceedingJoinPoint thisJoinPoint, int i, > java.lang.Object value) throws Throwable { > > if (value != null) { > > LOGGER.debug("Value type is " + value.getClass().getName()); > > } > > if (value != null && (value instanceof Utf8 || value instanceof > CharSequence)) { > > LOGGER.debug("In toString for i " + i); > > value = value.toString(); > > } > > LOGGER.debug("returning the i " + i); > > return thisJoinPoint.proceed(new Object[]{i, value}); > > } > > } > > On Mon, Jun 21, 2021 at 4:56 AM Martin Mucha wrote: > >> update after some research. >> >> It seems, that given configuration excerpt from my first mail actually >> works. The `@AvroGenerated` class will generate String, and try to cast >> Utf8 to String. >> >> The generated code looks like this: >> >> public void put(int field$, Object value$) { >> switch(field$) { >> case 0: >> this.someField = (String)value$; >> >> which is indeed incorrect. >> >> According to: >> >> https://issues.apache.org/jira/browse/AVRO-2702 >> >> this should be solved in 1.10 (which it is not, incorrect code is still >> generated). And if someone (like myself) is bound to 1.9.2 because of >> confluent, there is no fix for this minor version branch at all. There are >> some workarounds, but they do not cover all usecases, and for me the >> situation has the only solution of just accepting avro-team decision, that >> Utf8 is my favorite type now, and have to fix gazillion places in rather >> huge project, which is just awesome. >> >> po 21. 6. 2021 v 11:17 odesílatel Martin Mucha >> napsal: >> >>> It seems, that transition 1.8.2->1.9.2 brings backwards incomatibility >>> and >>> >>> >>> String >>> >>> which did work to change generation from CharSequence to String, does >>> not work any more. Within 15 minutes search I'm not unable to find literary >>> any documentation of this plugin, so I don't know if there is some new way >>> how to configure it for avro 1.9.2 and newer. >>> >>> Can someone advice? >>> Thanks. >>> >>
Re: how to configure not using Utf8 in avro-maven-plugin generate-sources
update after some research. It seems, that given configuration excerpt from my first mail actually works. The `@AvroGenerated` class will generate String, and try to cast Utf8 to String. The generated code looks like this: public void put(int field$, Object value$) { switch(field$) { case 0: this.someField = (String)value$; which is indeed incorrect. According to: https://issues.apache.org/jira/browse/AVRO-2702 this should be solved in 1.10 (which it is not, incorrect code is still generated). And if someone (like myself) is bound to 1.9.2 because of confluent, there is no fix for this minor version branch at all. There are some workarounds, but they do not cover all usecases, and for me the situation has the only solution of just accepting avro-team decision, that Utf8 is my favorite type now, and have to fix gazillion places in rather huge project, which is just awesome. po 21. 6. 2021 v 11:17 odesílatel Martin Mucha napsal: > It seems, that transition 1.8.2->1.9.2 brings backwards incomatibility and > > > String > > which did work to change generation from CharSequence to String, does not > work any more. Within 15 minutes search I'm not unable to find literary any > documentation of this plugin, so I don't know if there is some new way how > to configure it for avro 1.9.2 and newer. > > Can someone advice? > Thanks. >
how to configure not using Utf8 in avro-maven-plugin generate-sources
It seems, that transition 1.8.2->1.9.2 brings backwards incomatibility and String which did work to change generation from CharSequence to String, does not work any more. Within 15 minutes search I'm not unable to find literary any documentation of this plugin, so I don't know if there is some new way how to configure it for avro 1.9.2 and newer. Can someone advice? Thanks.
Re: Recomended naming of types to support for schema evolution
ronments, but tbh it hasn't caused any _real_ > problems yet, but it's something I would consider approaching with a global > registry (fed by my CI system?) in the future. > > >> ~ I really don't know how this work/should work, as there are close to no >> complete actual examples and documentation does not help much. For example >> if avro schema evolves from v1 to v2, and the type names and nameschema >> aren't the same, how will be the pairing between fields made ?? Completely >> puzzling. I need no less then schema evolution with backward and forward >> compatibility with schema reuse (ie. no hacks with top level union, but >> schema reusing using schema imports). I think I can hack my way through, by >> using one parser per set of 1 schema of given version and all needed >> imports, which will make everything working (well I don't yet know about >> anything which will fail), but it completely does not feel right. And I >> would like to know, what is the corret avro way. And I suppose it should be >> possible without confluent schema registry, just with single object >> encoding as I cannot see any difference between them, but please correct me >> if I'm wrong. >> > > You lost me here, I think you're maybe crossing some vocabulary from your > language stack, not from Avro per-se, but I'm coming at Avro from Ruby and > Node (yikes.) and have never used any JVM language integration, so assume > this is ignorance on my part. > > Maybe it'd help to know what "evolution" you plan, and what type names and > name schemas you plan to be changing? The "schema evolution" is mostly > meant to make it easier to add and remove fields from the schemas without > having to coordinate deploys and juggle iron-clad contract interchange > formats. It's not meant for wild rewrites of the contract IDLs on active > running services! > > All the best for 2020, anyone else who happens to be reading mailing list > emails this NYE! > > >> thanks, >> Mar. >> >> po 30. 12. 2019 v 20:32 odesílatel Lee Hambley >> napsal: >> >>> Hi Martin, >>> >>> I believe the answer is "just use the schema registry". When you then >>> encode for the network your library should give you a binary package with a >>> 5 byte header that includes the schema version and name from the registry. >>> The reader will when go to the registry and find that schema at that >>> version and use it for decoding. >>> >>> In my experience the naming/etc doesn't matter, only things like >>> defaults in enums and things need to be given a thought, but you'll see >>> that for yourself with experience. >>> >>> HTH, Regards, >>> >>> Lee Hambley >>> http://lee.hambley.name/ >>> +49 (0) 170 298 5667 >>> >>> >>> On Mon, 30 Dec 2019 at 17:26, Martin Mucha wrote: >>> >>>> Hi, >>>> I'm relatively new to avro, and I'm still struggling with getting >>>> schema evolution and related issues. But today it should be simple >>>> question. >>>> >>>> What is recommended naming of types if we want to use schema evolution? >>>> Should namespace contain some information about version of schema? Or >>>> should it be in type itself? Or neither? What is the best practice? Is >>>> evolution even possible if namespace/type name is different? >>>> >>>> I thought that "neither" it's the case, built the app so that version >>>> ID is nowhere except for the directory structure, only latest version is >>>> compiled to java classes using maven plugin, and parsed all other avsc >>>> files in code (to be able to build some sort of schema registry, identify >>>> used writer schema using single object encoding and use schema evolution). >>>> However I used separate Parser instance to parse each schema. But if one >>>> would like to use schema imports, he cannot have separate parser for every >>>> schema, and having global one in this setup is also not possible, as each >>>> type can be registered just once in org.apache.avro.Schema.Names. Btw. I >>>> favored this variant(ie. no ID in name/namespace) because in this setup, >>>> after I introduce new schema version, I do not have to change imports in >>>> whole project, but just one line in pom.xml saying which directory should >>>> be compiled into java files. >>>> >>>> so what could be the suggestion to correct naming-versioning scheme? >>>> thanks, >>>> M. >>>> >>>
Re: Recomended naming of types to support for schema evolution
Hi, thanks for answer. I don't understand avro sufficiently and don't know schema registry at all, actually. So maybe following questions will be dumb. a) how is schema registry with 5B header different from single object encoding with 10B header? b) will schema registry somehow relieve me from having to parse individual schemas? What if I want to/have to send 2 different version of certain schema? c) actually what I have here is (seemingly) pretty similar setup (and btw, which was recommended here as an alternative to confluent schema registry): it's a registry without an extra service. Trivial map mapping single object encoding long[data type] schema fingerprint, pairing schema fingerprint to schema. So when the bytes "arrive" I can easily read header, find out fingerprint, get hold onto schema and decode it. Trivial. But the snag is, that single Schema.Names instance can contain just one Name of given "identity", and equality is based on fully qualified type, ie. namespace and name. Thus if you have schema in 2 versions, which does have same namespace and name, they cannot be parsed using same Parser. Does schema registry (from confluent platform, right?) work differently than this? Does this "use it for decoding" process bypasses avros new Schema.Parser().parse and everything beneath it? ~ I really don't know how this work/should work, as there are close to no complete actual examples and documentation does not help much. For example if avro schema evolves from v1 to v2, and the type names and nameschema aren't the same, how will be the pairing between fields made ?? Completely puzzling. I need no less then schema evolution with backward and forward compatibility with schema reuse (ie. no hacks with top level union, but schema reusing using schema imports). I think I can hack my way through, by using one parser per set of 1 schema of given version and all needed imports, which will make everything working (well I don't yet know about anything which will fail), but it completely does not feel right. And I would like to know, what is the corret avro way. And I suppose it should be possible without confluent schema registry, just with single object encoding as I cannot see any difference between them, but please correct me if I'm wrong. thanks, Mar. po 30. 12. 2019 v 20:32 odesílatel Lee Hambley napsal: > Hi Martin, > > I believe the answer is "just use the schema registry". When you then > encode for the network your library should give you a binary package with a > 5 byte header that includes the schema version and name from the registry. > The reader will when go to the registry and find that schema at that > version and use it for decoding. > > In my experience the naming/etc doesn't matter, only things like defaults > in enums and things need to be given a thought, but you'll see that for > yourself with experience. > > HTH, Regards, > > Lee Hambley > http://lee.hambley.name/ > +49 (0) 170 298 5667 > > > On Mon, 30 Dec 2019 at 17:26, Martin Mucha wrote: > >> Hi, >> I'm relatively new to avro, and I'm still struggling with getting schema >> evolution and related issues. But today it should be simple question. >> >> What is recommended naming of types if we want to use schema evolution? >> Should namespace contain some information about version of schema? Or >> should it be in type itself? Or neither? What is the best practice? Is >> evolution even possible if namespace/type name is different? >> >> I thought that "neither" it's the case, built the app so that version ID >> is nowhere except for the directory structure, only latest version is >> compiled to java classes using maven plugin, and parsed all other avsc >> files in code (to be able to build some sort of schema registry, identify >> used writer schema using single object encoding and use schema evolution). >> However I used separate Parser instance to parse each schema. But if one >> would like to use schema imports, he cannot have separate parser for every >> schema, and having global one in this setup is also not possible, as each >> type can be registered just once in org.apache.avro.Schema.Names. Btw. I >> favored this variant(ie. no ID in name/namespace) because in this setup, >> after I introduce new schema version, I do not have to change imports in >> whole project, but just one line in pom.xml saying which directory should >> be compiled into java files. >> >> so what could be the suggestion to correct naming-versioning scheme? >> thanks, >> M. >> >
Recomended naming of types to support for schema evolution
Hi, I'm relatively new to avro, and I'm still struggling with getting schema evolution and related issues. But today it should be simple question. What is recommended naming of types if we want to use schema evolution? Should namespace contain some information about version of schema? Or should it be in type itself? Or neither? What is the best practice? Is evolution even possible if namespace/type name is different? I thought that "neither" it's the case, built the app so that version ID is nowhere except for the directory structure, only latest version is compiled to java classes using maven plugin, and parsed all other avsc files in code (to be able to build some sort of schema registry, identify used writer schema using single object encoding and use schema evolution). However I used separate Parser instance to parse each schema. But if one would like to use schema imports, he cannot have separate parser for every schema, and having global one in this setup is also not possible, as each type can be registered just once in org.apache.avro.Schema.Names. Btw. I favored this variant(ie. no ID in name/namespace) because in this setup, after I introduce new schema version, I do not have to change imports in whole project, but just one line in pom.xml saying which directory should be compiled into java files. so what could be the suggestion to correct naming-versioning scheme? thanks, M.
schema evolution with top-level union.
Hi, I encounter weird behavior and have no idea how to fix that. Any suggestions welcomed. The issue revolves around union type on top level, which I personally dislike and consider to be hack, but I understand the motivation behind it: someone wanted to declare N types withing single avsc file (probably). The drawback is, that this thing does not support avro schema evolution (read on). If there is possibility to reshape that avsc, so that multiple files are somehow available on top-level and evolution works, I'm listening. Now the code: old version of schema: { "namespace": "test", "name": "TestAvro", "type": "record", "fields": [ { "name": "a", "type": "string" } ] } updated version of schema, to which former should evolve: { "namespace": "test", "name": "TestAvro", "type": "record", "fields": [ { "name": "a", "type": "string" }, { "name": "b", "type": ["null", "string"], "default": null } ] } serialization: private byte[] serialize(final T data) { try (ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream()) { Encoder binaryEncoder = EncoderFactory.get().binaryEncoder(byteArrayOutputStream, null); DatumWriter datumWriter = new SpecificDatumWriter<>(data.getSchema()); datumWriter.write(data, binaryEncoder); binaryEncoder.flush(); return byteArrayOutputStream.toByteArray(); } catch (IOException e) { throw new RuntimeException(e); } } deserialization: private static T deserializeUsingSchemaEvolution(Class targetType, Schema readerSchema, Schema writerSchema, byte[] data) { try { if (data == null) { return null; } DatumReader datumReader = new SpecificDatumReader<>(writerSchema, readerSchema); Decoder decoder = DecoderFactory.get().binaryDecoder(data, null); return targetType.cast(datumReader.read(null, decoder)); } catch (Exception ex) { throw new SerializationException("Error deserializing data", ex); } } --> this WORKS. Data will be serialized and deserialized, evolution works as intended. now put square brackets to both avsc files, ie. add first and last character to those files, so that first char in those schemata is [ and ] is the last char. After that, deserialization won't work at all. Errors being produced vary wildly, depending on avro version and schemata. One can encounter simple "cannot be deserialized" errors, utf8 cannot be casted to string errors, or even X cannot be casted to Y, where X and Y are random types from top-level union, and where such casting makes no sense. Any suggestions would be greatly appreciated, as I inherited those schemata with top-level unions and really don't have any idea how to make them work. Thanks, M.
Re: AVRO schema evolution: adding optional column with default fails deserialization
Thanks for answer! Ad: "which byte[] are we talking about?" — actually I don't know. Please lets break it down together. I'm pretty sure, that we're not using confluent platform(iiuc the paid bundle, right?). I shared some serializer before [1], so you're saying, that this wont include neither schema ID, nor schema OK? Ok, lets assume that. Next. We're using SpringKafka project, to get this serialized data and send them over kafka. So we don't have any schema registry, but in principle it could be possible to include schema within each message. But I cannot see how that could be done. SpringKafka requires us to provide him org.apache.kafka.clients.producer.ProducerConfig#VALUE_SERIALIZER_CLASS_CONFIG, which we did, but it's just a class calling serializer [1], and from that point on I have no idea how it could figure out used schema. The question here I'm asking is, whether when sending avro bytes (obtained by provided serializer[1]), they are or can be somehow paired with schema used to serialize data? Is this what kafka senders do, or can do? Include ID/whole schema somewhere in headers or ...??? And when I read kafka messages, will the schema be (or could be) somewhere stored in ConsumerRecord or somewhere like that? sorry for confused questions, but I'm really missing knowledge to even ask properly. thanks, Martin. [1] public static byte[] serialize(T data, boolean useBinaryDecoder, boolean pretty) { try { if (data == null) { return new byte[0]; } log.debug("data='{}'", data); Schema schema = data.getSchema(); ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(); Encoder binaryEncoder = useBinaryDecoder ? EncoderFactory.get().binaryEncoder(byteArrayOutputStream, null) : EncoderFactory.get().jsonEncoder(schema, byteArrayOutputStream, pretty); DatumWriter datumWriter = new GenericDatumWriter<>(schema); datumWriter.write(data, binaryEncoder); binaryEncoder.flush(); byteArrayOutputStream.close(); byte[] result = byteArrayOutputStream.toByteArray(); log.debug("serialized data='{}'", DatatypeConverter.printHexBinary(result)); return result; } catch (IOException ex) { throw new SerializationException( "Can't serialize data='" + data, ex); } } čt 1. 8. 2019 v 17:06 odesílatel Svante Karlsson napsal: > For clarity: What byte[] are we talking about? > > You are slightly missing my point if we are speaking about kafka. > > Confluent encoding: > > 0 schema_id avro > > avro_binary_payload does not in any case contain the schema or schema id. > The schema id is a confluent thing. (in an avrofile the schema is prepended > by value in the file) > > While it's trivial to build a schema registry that for example instead > gives you a md5 hash of the schema you have to use it throughout your > infrastructure OR use known reader and writer schema (ie hardcoded). > > In confluent world the id=N is the N+1'th registered schema in the > database (a kafka topic) if I remember right. Loose that database and you > cannot read your kafka topics. > > So you have to use some other encoder, homegrown or not that embeds either > the full schema in every message (expensive) of some id. Does this make > sense? > > /svante > > > > > > > > > > > Den tors 1 aug. 2019 kl 16:38 skrev Martin Mucha : > >> Thanks for answer. >> >> What I knew already is, that in each message there is _somehow_ present >> either _some_ schema ID or full schema. I saw some byte array manipulations >> to get _somehow_ defined schema ID from byte[], which worked, but that's >> definitely not how it should be done. What I'm looking for is some >> documentation of _how_ to do these things right. I really cannot find a >> single thing, yet there must be some util functions, or anything. Is there >> some devel-first-steps page, where can I find answers for: >> >> * How to test, whether byte[] contains full schema or just id? >> * How to control, whether message is serialized with ID or with full >> schema? >> * how to get ID from byte[]? >> * how to get full schema from byte[]? >> >> I don't have confluent platform, and cannot have it, but implementing >> "get schema by ID" should be easy task, provided, that I have that ID. In >> my scenario I know, that message will be written using one schema, just >> different versions of it. So I just need to know, which version it is, so >> that I can configure deserializer to enable schema evolution. >> >
Re: AVRO schema evolution: adding optional column with default fails deserialization
Thanks for answer. What I knew already is, that in each message there is _somehow_ present either _some_ schema ID or full schema. I saw some byte array manipulations to get _somehow_ defined schema ID from byte[], which worked, but that's definitely not how it should be done. What I'm looking for is some documentation of _how_ to do these things right. I really cannot find a single thing, yet there must be some util functions, or anything. Is there some devel-first-steps page, where can I find answers for: * How to test, whether byte[] contains full schema or just id? * How to control, whether message is serialized with ID or with full schema? * how to get ID from byte[]? * how to get full schema from byte[]? I don't have confluent platform, and cannot have it, but implementing "get schema by ID" should be easy task, provided, that I have that ID. In my scenario I know, that message will be written using one schema, just different versions of it. So I just need to know, which version it is, so that I can configure deserializer to enable schema evolution. thanks in advance, Martin čt 1. 8. 2019 v 15:55 odesílatel Svante Karlsson napsal: > In an avrofile the schema is in the beginning but if you refer a single > record serialization like Kafka then you have to add something that you can > use to get hold of the schema. Confluents avroencoder for Kafka uses > confluents schema registry that uses int32 as schema Id. This is prepended > (+a magic byte) to the binary avro. Thus using the schema registry again > you can get the writer schema. > > /Svante > > On Thu, Aug 1, 2019, 15:30 Martin Mucha wrote: > >> Hi, >> >> just one more question, not strictly related to the subject. >> >> Initially I though I'd be OK with using some initial version of schema in >> place of writer schema. That works, but all columns from schema older than >> this initial one would be just ignored. So I need to know EXACTLY the >> schema, which writer used. I know, that avro messages contains either full >> schema or at least it's ID. Can you point me to the documentation, where >> this is discussed? So in my deserializer I have byte[] as a input, from >> which I need to get the schema information first, in order to be able to >> deserialize the record. I really do not know how to do that, I'm pretty >> sure I never saw this anywhere, and I cannot find it anywhere. But in >> principle it must be possible, since reader need not necessarily have any >> control of which schema writer used. >> >> thanks a lot. >> M. >> >> út 30. 7. 2019 v 18:16 odesílatel Martin Mucha >> napsal: >> >>> Thank you very much for in depth answer. I understand how it works now >>> better, will test it shortly. >>> Thank you for your time. >>> >>> Martin. >>> >>> út 30. 7. 2019 v 17:09 odesílatel Ryan Skraba napsal: >>> >>>> Hello! It's the same issue in your example code as allegro, even with >>>> the SpecificDatumReader. >>>> >>>> This line: datumReader = new SpecificDatumReader<>(schema) >>>> should be: datumReader = new SpecificDatumReader<>(originalSchema, >>>> schema) >>>> >>>> In Avro, the original schema is commonly known as the writer schema >>>> (the instance that originally wrote the binary data). Schema >>>> evolution applies when you are using the constructor of the >>>> SpecificDatumReader that takes *both* reader and writer schemas. >>>> >>>> As a concrete example, if your original schema was: >>>> >>>> { >>>> "type": "record", >>>> "name": "Simple", >>>> "fields": [ >>>> {"name": "id", "type": "int"}, >>>> {"name": "name","type": "string"} >>>> ] >>>> } >>>> >>>> And you added a field: >>>> >>>> { >>>> "type": "record", >>>> "name": "SimpleV2", >>>> "fields": [ >>>> {"name": "id", "type": "int"}, >>>> {"name": "name", "type": "string"}, >>>> {"name": "description","type": ["null", "string"]} >>>> ] >>>> } >>>> >>>> You could do the following safely, assuming that Simple and SimpleV2 >>>> class
Re: AVRO schema evolution: adding optional column with default fails deserialization
Hi, just one more question, not strictly related to the subject. Initially I though I'd be OK with using some initial version of schema in place of writer schema. That works, but all columns from schema older than this initial one would be just ignored. So I need to know EXACTLY the schema, which writer used. I know, that avro messages contains either full schema or at least it's ID. Can you point me to the documentation, where this is discussed? So in my deserializer I have byte[] as a input, from which I need to get the schema information first, in order to be able to deserialize the record. I really do not know how to do that, I'm pretty sure I never saw this anywhere, and I cannot find it anywhere. But in principle it must be possible, since reader need not necessarily have any control of which schema writer used. thanks a lot. M. út 30. 7. 2019 v 18:16 odesílatel Martin Mucha napsal: > Thank you very much for in depth answer. I understand how it works now > better, will test it shortly. > Thank you for your time. > > Martin. > > út 30. 7. 2019 v 17:09 odesílatel Ryan Skraba napsal: > >> Hello! It's the same issue in your example code as allegro, even with >> the SpecificDatumReader. >> >> This line: datumReader = new SpecificDatumReader<>(schema) >> should be: datumReader = new SpecificDatumReader<>(originalSchema, schema) >> >> In Avro, the original schema is commonly known as the writer schema >> (the instance that originally wrote the binary data). Schema >> evolution applies when you are using the constructor of the >> SpecificDatumReader that takes *both* reader and writer schemas. >> >> As a concrete example, if your original schema was: >> >> { >> "type": "record", >> "name": "Simple", >> "fields": [ >> {"name": "id", "type": "int"}, >> {"name": "name","type": "string"} >> ] >> } >> >> And you added a field: >> >> { >> "type": "record", >> "name": "SimpleV2", >> "fields": [ >> {"name": "id", "type": "int"}, >> {"name": "name", "type": "string"}, >> {"name": "description","type": ["null", "string"]} >> ] >> } >> >> You could do the following safely, assuming that Simple and SimpleV2 >> classes are generated from the avro-maven-plugin: >> >> @Test >> public void testSerializeDeserializeEvolution() throws IOException { >> // Write a Simple v1 to bytes using your exact method. >> byte[] v1AsBytes = serialize(new Simple(1, "name1"), true, false); >> >> // Read as Simple v2, same as your method but with the writer and >> reader schema. >> DatumReader datumReader = >> new SpecificDatumReader<>(Simple.getClassSchema(), >> SimpleV2.getClassSchema()); >> Decoder decoder = DecoderFactory.get().binaryDecoder(v1AsBytes, null); >> SimpleV2 v2 = datumReader.read(null, decoder); >> >> assertThat(v2.getId(), is(1)); >> assertThat(v2.getName(), is(new Utf8("name1"))); >> assertThat(v2.getDescription(), nullValue()); >> } >> >> This demonstrates with two different schemas and SpecificRecords in >> the same test, but the same principle applies if it's the same record >> that has evolved -- you need to know the original schema that wrote >> the data in order to apply the schema that you're now using for >> reading. >> >> I hope this clarifies what you are looking for! >> >> All my best, Ryan >> >> >> >> On Tue, Jul 30, 2019 at 3:30 PM Martin Mucha wrote: >> > >> > Thanks for answer. >> > >> > Actually I have exactly the same behavior with avro 1.9.0 and following >> deserializer in our other app, which uses strictly avro codebase, and >> failing with same exceptions. So lets leave "allegro" library and lots of >> other tools out of it in our discussion. >> > I can use whichever aproach. All I need is single way, where I can >> deserialize byte[] into class generated by avro-maven-plugin, and which >> will respect documentation regarding schema evolution. Currently we're >> using following deserializer and serializer, and these does not work when >> it comes to schema evolution. What is the correct way to serialize and >> deserializer avro data?
Re: AVRO schema evolution: adding optional column with default fails deserialization
Thank you very much for in depth answer. I understand how it works now better, will test it shortly. Thank you for your time. Martin. út 30. 7. 2019 v 17:09 odesílatel Ryan Skraba napsal: > Hello! It's the same issue in your example code as allegro, even with > the SpecificDatumReader. > > This line: datumReader = new SpecificDatumReader<>(schema) > should be: datumReader = new SpecificDatumReader<>(originalSchema, schema) > > In Avro, the original schema is commonly known as the writer schema > (the instance that originally wrote the binary data). Schema > evolution applies when you are using the constructor of the > SpecificDatumReader that takes *both* reader and writer schemas. > > As a concrete example, if your original schema was: > > { > "type": "record", > "name": "Simple", > "fields": [ > {"name": "id", "type": "int"}, > {"name": "name","type": "string"} > ] > } > > And you added a field: > > { > "type": "record", > "name": "SimpleV2", > "fields": [ > {"name": "id", "type": "int"}, > {"name": "name", "type": "string"}, > {"name": "description","type": ["null", "string"]} > ] > } > > You could do the following safely, assuming that Simple and SimpleV2 > classes are generated from the avro-maven-plugin: > > @Test > public void testSerializeDeserializeEvolution() throws IOException { > // Write a Simple v1 to bytes using your exact method. > byte[] v1AsBytes = serialize(new Simple(1, "name1"), true, false); > > // Read as Simple v2, same as your method but with the writer and > reader schema. > DatumReader datumReader = > new SpecificDatumReader<>(Simple.getClassSchema(), > SimpleV2.getClassSchema()); > Decoder decoder = DecoderFactory.get().binaryDecoder(v1AsBytes, null); > SimpleV2 v2 = datumReader.read(null, decoder); > > assertThat(v2.getId(), is(1)); > assertThat(v2.getName(), is(new Utf8("name1"))); > assertThat(v2.getDescription(), nullValue()); > } > > This demonstrates with two different schemas and SpecificRecords in > the same test, but the same principle applies if it's the same record > that has evolved -- you need to know the original schema that wrote > the data in order to apply the schema that you're now using for > reading. > > I hope this clarifies what you are looking for! > > All my best, Ryan > > > > On Tue, Jul 30, 2019 at 3:30 PM Martin Mucha wrote: > > > > Thanks for answer. > > > > Actually I have exactly the same behavior with avro 1.9.0 and following > deserializer in our other app, which uses strictly avro codebase, and > failing with same exceptions. So lets leave "allegro" library and lots of > other tools out of it in our discussion. > > I can use whichever aproach. All I need is single way, where I can > deserialize byte[] into class generated by avro-maven-plugin, and which > will respect documentation regarding schema evolution. Currently we're > using following deserializer and serializer, and these does not work when > it comes to schema evolution. What is the correct way to serialize and > deserializer avro data? > > > > I probably don't understand your mention about GenericRecord or > GenericDatumReader. I tried to use GenericDatumReader in deserializer > below, but then it seems I got back just GenericData$Record instance, which > I can use then to access array of instances, which is not what I'm looking > for(IIUC), since in that case I could have just use plain old JSON and > deserialize it using jackson having no schema evolution problems at all. If > that's correct, I'd rather stick to SpecificDatumReader, and somehow fix it > if possible. > > > > What can be done? Or how schema evolution is intended to be used? I > found a lots of question searching for this answer. > > > > thanks! > > Martin. > > > > deserializer: > > > > public static T deserialize(Class > targetType, > >byte[] > data, > >boolean > useBinaryDecoder) { > > try { > > if (data == null) { > > return null; > > } > > > > log.trace("data='{}'", > DatatypeConverter.pri
Re: AVRO schema evolution: adding optional column with default fails deserialization
Thanks for answer. Actually I have exactly the same behavior with avro 1.9.0 and following deserializer in our other app, which uses strictly avro codebase, and failing with same exceptions. So lets leave "allegro" library and lots of other tools out of it in our discussion. I can use whichever aproach. All I need is single way, where I can deserialize byte[] into class generated by avro-maven-plugin, and which will respect documentation regarding schema evolution. Currently we're using following deserializer and serializer, and these does not work when it comes to schema evolution. What is the correct way to serialize and deserializer avro data? I probably don't understand your mention about GenericRecord or GenericDatumReader. I tried to use GenericDatumReader in deserializer below, but then it seems I got back just GenericData$Record instance, which I can use then to access array of instances, which is not what I'm looking for(IIUC), since in that case I could have just use plain old JSON and deserialize it using jackson having no schema evolution problems at all. If that's correct, I'd rather stick to SpecificDatumReader, and somehow fix it if possible. What can be done? Or how schema evolution is intended to be used? I found a lots of question searching for this answer. thanks! Martin. deserializer: public static T deserialize(Class targetType, byte[] data, boolean useBinaryDecoder) { try { if (data == null) { return null; } log.trace("data='{}'", DatatypeConverter.printHexBinary(data)); Schema schema = targetType.newInstance().getSchema(); DatumReader datumReader = new SpecificDatumReader<>(schema); Decoder decoder = useBinaryDecoder ? DecoderFactory.get().binaryDecoder(data, null) : DecoderFactory.get().jsonDecoder(schema, new String(data)); T result = targetType.cast(datumReader.read(null, decoder)); log.trace("deserialized data='{}'", result); return result; } catch (Exception ex) { throw new SerializationException("Error deserializing data", ex); } } serializer: public static byte[] serialize(T data, boolean useBinaryDecoder, boolean pretty) { try { if (data == null) { return new byte[0]; } log.debug("data='{}'", data); Schema schema = data.getSchema(); ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(); Encoder binaryEncoder = useBinaryDecoder ? EncoderFactory.get().binaryEncoder(byteArrayOutputStream, null) : EncoderFactory.get().jsonEncoder(schema, byteArrayOutputStream, pretty); DatumWriter datumWriter = new GenericDatumWriter<>(schema); datumWriter.write(data, binaryEncoder); binaryEncoder.flush(); byteArrayOutputStream.close(); byte[] result = byteArrayOutputStream.toByteArray(); log.debug("serialized data='{}'", DatatypeConverter.printHexBinary(result)); return result; } catch (IOException ex) { throw new SerializationException( "Can't serialize data='" + data, ex); } } út 30. 7. 2019 v 13:48 odesílatel Ryan Skraba napsal: > Hello! Schema evolution relies on both the writer and reader schemas > being available. > > It looks like the allegro tool you are using is using the > GenericDatumReader that assumes the reader and writer schema are the > same: > > > https://github.com/allegro/json-avro-converter/blob/json-avro-converter-0.2.8/converter/src/main/java/tech/allegro/schema/json2avro/converter/JsonAvroConverter.java#L83 > > I do not believe that the "default" value is taken into account for > data that is strictly missing from the binary input, just when a field > is known to be in the reader schema but missing from the original > writer. > > You may have more luck reading the GenericRecord with a > GenericDatumReader with both schemas, and using the > `convertToJson(record)`. > > I hope this is useful -- Ryan > > > > On Tue, Jul 30, 2019 at 10:20 AM Martin Mucha wrote: > > > > Hi, > > > > I've got some issues/misunderstanding of AVRO schema evolution. > > > > When reading through avro documentation, for example [1], I understood, > that schema evolution is supported, and if I added column with specified > default, it should be backwards compatible (and even forward when I remove > it again). Sounds great, so I added column defined as: > &g
Re: is it possible to deserialize JSON with optional field?
Hi, thanks for responding. I know that you promote your fork, however considering I might not be able to move away from "official release", is there an easy way how to consume this? Since I cannot see it ... Maybe side question: official avro seems to be dead. There are some commits made, but last release happened 2 years ago, fatal flaws are not being addressed, almost 10 years old valid bug reports are just ignored, ... Does anyone know about any sign/confirmation that avro community will be moving toward something more viable? M. po 15. 4. 2019 v 15:17 odesílatel Zoltan Farkas napsal: > It is possible to do it with a custom JsonDecoder. > > I wrote one that does this at: > https://github.com/zolyfarkas/avro/blob/trunk/lang/java/avro/src/main/java/org/apache/avro/io/ExtendedJsonDecoder.java > > > hope it helps. > > > —Z > > On Apr 13, 2019, at 9:24 AM, Martin Mucha wrote: > > Hi, > > is it possible by design to deserialize JSON with schema having optional > value? > Schema: > > { > "type" : "record", > "name" : "UserSessionEvent", > "namespace" : "events", > "fields" : [ { >"name" : "username", >"type" : "string" > }, { >"name" : "errorData", >"type" : [ "null", "string" ], >"default" : null > }]} > > Value to deserialize: > > {"username" : "2271AE67-34DE-4B43-8839-07216C5D10E1"} > > I also tried to change order of type, but that changed nothing. I know I > can produce ill-formated JSON which could be deserialized, but that's not > acceptable. AFAIK given JSON with required `username` and optional > `errorData` cannot be deserialized by design. Am I right? > > thanks. > > >
is it possible to deserialize JSON with optional field?
Hi, is it possible by design to deserialize JSON with schema having optional value? Schema: { "type" : "record", "name" : "UserSessionEvent", "namespace" : "events", "fields" : [ { "name" : "username", "type" : "string" }, { "name" : "errorData", "type" : [ "null", "string" ], "default" : null }]} Value to deserialize: {"username" : "2271AE67-34DE-4B43-8839-07216C5D10E1"} I also tried to change order of type, but that changed nothing. I know I can produce ill-formated JSON which could be deserialized, but that's not acceptable. AFAIK given JSON with required `username` and optional `errorData` cannot be deserialized by design. Am I right? thanks.
Re: how to do non-optional field in nested object?
Following schema: { "name" : "ARecord", "type" : "record", "namespace" : "AAA", "fields" : [ {"name": "A", "type": "string" } ] } does not validate json {} as valid one. It's invalid. A is required, as would be expected by me. --- I'm getting lost here. What is the resolution? I kinda wanted to use avro schema to validate described JSON, and now based on what you said IIUC validation in "xschema style": "1 of this, followed by 2 of that" is not possible with avro schema. Correct? Btw. what do you use to validate JSON using avro? I used avro-utils.jar executed from command line, which proved as incapable of deserializing & validating optional fields if they are set (if optional field is set, I have to pass value in json like: {"string":"value"}). So now I'm using for testing purposes actual flow through NIFI, which is extremely cumbersome. 2017-11-27 16:19 GMT+01:00 Dan Schmitt <dan.schm...@gmail.com>: > The top level object in all the examples is a record (of which you can > have 0 or more.) > > So, right now, even the top level is failing the spec: > > IV) valid (0 ARecords): > { } > > V) valid (2 ARecords): > { > "id": "...", > "B": { > "C": "..." > } > } , > "id": "...", > "B": { > "C": "..." > } > } > > On Mon, Nov 27, 2017 at 9:47 AM, Martin Mucha <alfon...@gmail.com> wrote: > > I don't understand where "or more" is comming from. > > Because of the use of nested records. Anywhere you put a record you can > have 0, > or more than 1. I don't know of a pure schema way to enforce only > 1/optional. > > If you are doing something with any of the APIs you could add a validation > step > that says "must have only 1 ARecord, and must have only 1 BRecord" after > you > read the data and throw an error for the 0 or 1+ situations, but you'd > need to write > some code somewhere with one of the APIs and build your own validator. >
Re: how to do non-optional field in nested object?
I don't understand where "or more" is comming from. IIUC (and I need not), there's just one top-level json object. If so, there's 1 ARecord. ARecord has required ID field, thus has 1 ID field. Then it has second field BRecord, one of them. This Brecord has 2 required fields, B and C, each should come exactly once. Right? What I'm after is: I) valid: { "id": "..." } II) invalid: { "id": "...", "B": { } } III) valid: { "id": "...", "B": { "C": "..." } } IV) everything else is invalid. 2017-11-27 15:39 GMT+01:00 Dan Schmitt <dan.schm...@gmail.com>: > The problem is BRecord can be 0 or more and you still end up with the a > valid B. > > How about > > { > "name" : "ARecord", > "type" : "record", > "namespace" : "A", > "fields" : [ > {"name": "id", "type": "string" }, > { > "name": "BRecord", > "type": "record", > "fields": [ > { "name": "B", "type": "string" }, > { "name": "C", "type": "string" } > ] > } > } > ] > } > > This gives me 0 or more ARecords, each with and id, and 0 or more BRecords > associated with each ARecord each with a B and C. If you wanted one or > more > C's I don't see a trivial clean way to do that (you could add a > Cextras array to the > BRecord to get 0 or more additional C things, but that feels unclean.) > > > On Mon, Nov 27, 2017 at 9:10 AM, Martin Mucha <alfon...@gmail.com> wrote: > > Thanks for reply. > > > > Sadly it does not work that way (here). Even: > > > > { > > "name" : "ARecord", > > "type" : "record", > > "namespace" : "A", > > "fields" : [ > > {"name": "id", "type": "string" }, > > { > > "name": "B", > > "type": { > > "type": "record", > > "name": "BRecord", > > "fields": [ > > { > > "name": "C", > > "type": "string" > > } > > ] > > } > > } > > ] > > } > > > > does not require C. And that's not what I want ... I'd like optional B, > and > > once user provide B, then B.C is required. > > > > Martin. > > > > > > 2017-11-27 15:06 GMT+01:00 Dan Schmitt <dan.schm...@gmail.com>: > >> > >> "name": "B", > >> "type": ["null", { > >> > >> The [] union lets you do null or a BRecord, your JSON does null. > >> Pull the null from the union and it will require the C. > >> > >> On Mon, Nov 27, 2017 at 9:00 AM, Martin Mucha <alfon...@gmail.com> > wrote: > >> > Hi, > >> > > >> > I have this avro schema: > >> > > >> > { > >> > "name" : "ARecord", > >> > "type" : "record", > >> > "namespace" : "A", > >> > "fields" : [ > >> > {"name": "id", "type": "string" }, > >> > { > >> > "name": "B", > >> > "type": ["null", { > >> > "type": "record", > >> > "name": "BRecord", > >> > "fields": [ > >> > { > >> > "name": "C", > >> > "type": "string" > >> > } > >> > ] > >> > }] > >> > } > >> > ] > >> > } > >> > > >> > > >> > and following JSON, which validates against it: > >> > > >> > { > >> > "id": "...", > >> > "B": { > >> > > >> > } > >> > } > >> > > >> > > >> > I would expect, that C is required. Why it's not? What shall I do to > >> > make it > >> > required? > >> > > >> > Thanks! > >> > Martin. > > > > >
Re: how to do non-optional field in nested object?
Thanks for reply. Sadly it does not work that way (here). Even: { "name" : "ARecord", "type" : "record", "namespace" : "A", "fields" : [ {"name": "id", "type": "string" }, { "name": "B", "type": { "type": "record", "name": "BRecord", "fields": [ { "name": "C", "type": "string" } ] } } ] } does not require C. And that's not what I want ... I'd like optional B, and once user provide B, then B.C is required. Martin. 2017-11-27 15:06 GMT+01:00 Dan Schmitt <dan.schm...@gmail.com>: > "name": "B", > "type": ["null", { > > The [] union lets you do null or a BRecord, your JSON does null. > Pull the null from the union and it will require the C. > > On Mon, Nov 27, 2017 at 9:00 AM, Martin Mucha <alfon...@gmail.com> wrote: > > Hi, > > > > I have this avro schema: > > > > { > > "name" : "ARecord", > > "type" : "record", > > "namespace" : "A", > > "fields" : [ > > {"name": "id", "type": "string" }, > > { > > "name": "B", > > "type": ["null", { > > "type": "record", > > "name": "BRecord", > > "fields": [ > > { > > "name": "C", > > "type": "string" > > } > > ] > > }] > > } > > ] > > } > > > > > > and following JSON, which validates against it: > > > > { > > "id": "...", > > "B": { > > > > } > > } > > > > > > I would expect, that C is required. Why it's not? What shall I do to > make it > > required? > > > > Thanks! > > Martin. >
how to do non-optional field in nested object?
Hi, I have this avro schema: { "name" : "ARecord", "type" : "record", "namespace" : "A", "fields" : [ {"name": "id", "type": "string" }, { "name": "B", "type": ["null", { "type": "record", "name": "BRecord", "fields": [ { "name": "C", "type": "string" } ] }] } ] } and following JSON, which validates against it: { "id": "...", "B": { } } I would expect, that C is required. Why it's not? What shall I do to make it required? Thanks! Martin.