Re: how to configure not using Utf8 in avro-maven-plugin generate-sources

2021-06-23 Thread Martin Mucha
Thanks, looks nice. I need to dig (much) more into acpect oriented
programming myself, so that I can use these stuff.

To update about issue: actually, the configuration[1] of avro-maven-plugin
works, what I was facing was some avro version incompatibility (between
minor versions). After resolving that, basic types are correctly generated
as String and correctly deserializes. However, what is not correctly
handled, and I'm sure about it this time, is arrays. Field definition like:

{
  "name": "codes",
  "type": ["null", {
"type": "array",
"name": "codeArray",
"items": {
  "type": "string"
}
  }],
  "default": null
}

will be cast in put method to List, but individual items of that
array will be instances of Utf8. I'm writing you that, because your aspect
does not handle this situation.

Thanks for your help!
M.


[1]
 
  String
  PRIVATE

po 21. 6. 2021 v 18:53 odesílatel Chad Preisler 
napsal:

> I created a point cut to work around this issue.
>
> @Aspect
>
> public class SpecificRecordBasePutPointCut {
>
>
>
> public static final Logger LOGGER =
> LoggerFactory.getLogger(SpecificRecordBasePutPointCut.class);
>
>
>
> @Pointcut("execution(* your.package.for.generated.code.*.put(int,
> java.lang.Object)) && args(i, value)")
>
> void put(int i, java.lang.Object value) {}
>
>
>
> @Around("put(i, value)")
>
> public Object anyPutCall(ProceedingJoinPoint thisJoinPoint, int i,
> java.lang.Object value) throws Throwable {
>
>   if (value != null) {
>
> LOGGER.debug("Value type is " + value.getClass().getName());
>
>   }
>
>   if (value != null && (value instanceof Utf8 || value instanceof
> CharSequence)) {
>
>       LOGGER.debug("In toString for i " + i);
>
>   value = value.toString();
>
>   }
>
>   LOGGER.debug("returning the i " + i);
>
>   return thisJoinPoint.proceed(new Object[]{i, value});
>
> }
>
> }
>
> On Mon, Jun 21, 2021 at 4:56 AM Martin Mucha  wrote:
>
>> update after some research.
>>
>> It seems, that given configuration excerpt from my first mail actually
>> works. The `@AvroGenerated` class will generate String, and try to cast
>> Utf8 to String.
>>
>> The generated code looks like this:
>>
>> public void put(int field$, Object value$) {
>> switch(field$) {
>> case 0:
>> this.someField = (String)value$;
>>
>> which is indeed incorrect.
>>
>> According to:
>>
>> https://issues.apache.org/jira/browse/AVRO-2702
>>
>> this should be solved in 1.10 (which it is not, incorrect code is still
>> generated). And if someone (like myself) is bound to 1.9.2 because of
>> confluent, there is no fix for this minor version branch at all. There are
>> some workarounds, but they do not cover all usecases, and for me the
>> situation has the only solution of just accepting avro-team decision, that
>> Utf8 is my favorite type now, and have to fix gazillion places in rather
>> huge project, which is just awesome.
>>
>> po 21. 6. 2021 v 11:17 odesílatel Martin Mucha 
>> napsal:
>>
>>> It seems, that transition 1.8.2->1.9.2 brings backwards incomatibility
>>> and
>>>
>>> 
>>>   String
>>>
>>> which did work to change generation from CharSequence to String, does
>>> not work any more. Within 15 minutes search I'm not unable to find literary
>>> any documentation of this plugin, so I don't know if there is some new way
>>> how to configure it for avro 1.9.2 and newer.
>>>
>>> Can someone advice?
>>> Thanks.
>>>
>>


Re: how to configure not using Utf8 in avro-maven-plugin generate-sources

2021-06-21 Thread Martin Mucha
update after some research.

It seems, that given configuration excerpt from my first mail actually
works. The `@AvroGenerated` class will generate String, and try to cast
Utf8 to String.

The generated code looks like this:

public void put(int field$, Object value$) {
switch(field$) {
case 0:
this.someField = (String)value$;

which is indeed incorrect.

According to:

https://issues.apache.org/jira/browse/AVRO-2702

this should be solved in 1.10 (which it is not, incorrect code is still
generated). And if someone (like myself) is bound to 1.9.2 because of
confluent, there is no fix for this minor version branch at all. There are
some workarounds, but they do not cover all usecases, and for me the
situation has the only solution of just accepting avro-team decision, that
Utf8 is my favorite type now, and have to fix gazillion places in rather
huge project, which is just awesome.

po 21. 6. 2021 v 11:17 odesílatel Martin Mucha  napsal:

> It seems, that transition 1.8.2->1.9.2 brings backwards incomatibility and
>
> 
>   String
>
> which did work to change generation from CharSequence to String, does not
> work any more. Within 15 minutes search I'm not unable to find literary any
> documentation of this plugin, so I don't know if there is some new way how
> to configure it for avro 1.9.2 and newer.
>
> Can someone advice?
> Thanks.
>


how to configure not using Utf8 in avro-maven-plugin generate-sources

2021-06-21 Thread Martin Mucha
It seems, that transition 1.8.2->1.9.2 brings backwards incomatibility and


  String

which did work to change generation from CharSequence to String, does not
work any more. Within 15 minutes search I'm not unable to find literary any
documentation of this plugin, so I don't know if there is some new way how
to configure it for avro 1.9.2 and newer.

Can someone advice?
Thanks.


Re: Recomended naming of types to support for schema evolution

2020-01-01 Thread Martin Mucha
ronments, but tbh it hasn't caused any _real_
> problems yet, but it's something I would consider approaching with a global
> registry (fed by my CI system?) in the future.
>
>
>> ~ I really don't know how this work/should work, as there are close to no
>> complete actual examples and documentation does not help much. For example
>> if avro schema evolves from v1 to v2, and the type names and nameschema
>> aren't the same, how will be the pairing between fields made ?? Completely
>> puzzling. I need no less then schema evolution with backward and forward
>> compatibility with schema reuse (ie. no hacks with top level union, but
>> schema reusing using schema imports). I think I can hack my way through, by
>> using one parser per set of 1 schema of given version and all needed
>> imports, which will make everything working (well I don't yet know about
>> anything which will fail), but it completely does not feel right. And I
>> would like to know, what is the corret avro way. And I suppose it should be
>> possible without confluent schema registry, just with single object
>> encoding as I cannot see any difference between them, but please correct me
>> if I'm wrong.
>>
>
> You lost me here, I think you're maybe crossing some vocabulary from your
> language stack, not from Avro per-se, but I'm coming at Avro from Ruby and
> Node (yikes.) and have never used any JVM language integration, so assume
> this is ignorance on my part.
>
> Maybe it'd help to know what "evolution" you plan, and what type names and
> name schemas you plan to be changing? The "schema evolution" is mostly
> meant to make it easier to add and remove fields from the schemas without
> having to coordinate deploys and juggle iron-clad contract interchange
> formats. It's not meant for wild rewrites of the contract IDLs on active
> running services!
>
> All the best for 2020, anyone else who happens to be reading mailing list
> emails this NYE!
>
>
>> thanks,
>> Mar.
>>
>> po 30. 12. 2019 v 20:32 odesílatel Lee Hambley 
>> napsal:
>>
>>> Hi Martin,
>>>
>>> I believe the answer is "just use the schema registry". When you then
>>> encode for the network your library should give you a binary package with a
>>> 5 byte header that includes the schema version and name from the registry.
>>> The reader will when go to the registry and find that schema at that
>>> version and use it for decoding.
>>>
>>> In my experience the naming/etc doesn't matter, only things like
>>> defaults in enums and things need to be given a thought, but you'll see
>>> that for yourself with experience.
>>>
>>> HTH, Regards,
>>>
>>> Lee Hambley
>>> http://lee.hambley.name/
>>> +49 (0) 170 298 5667
>>>
>>>
>>> On Mon, 30 Dec 2019 at 17:26, Martin Mucha  wrote:
>>>
>>>> Hi,
>>>> I'm relatively new to avro, and I'm still struggling with getting
>>>> schema evolution and related issues. But today it should be simple 
>>>> question.
>>>>
>>>> What is recommended naming of types if we want to use schema evolution?
>>>> Should namespace contain some information about version of schema? Or
>>>> should it be in type itself? Or neither? What is the best practice? Is
>>>> evolution even possible if namespace/type name is different?
>>>>
>>>> I thought that "neither" it's the case, built the app so that version
>>>> ID is nowhere except for the directory structure, only latest version is
>>>> compiled to java classes using maven plugin, and parsed all other avsc
>>>> files in code (to be able to build some sort of schema registry, identify
>>>> used writer schema using single object encoding and use schema evolution).
>>>> However I used separate Parser instance to parse each schema. But if one
>>>> would like to use schema imports, he cannot have separate parser for every
>>>> schema, and having global one in this setup is also not possible, as each
>>>> type can be registered just once in org.apache.avro.Schema.Names. Btw. I
>>>> favored this variant(ie. no ID in name/namespace) because in this setup,
>>>> after I introduce new schema version, I do not have to change imports in
>>>> whole project, but just one line in pom.xml saying which directory should
>>>> be compiled into java files.
>>>>
>>>> so what could be the suggestion to correct naming-versioning scheme?
>>>> thanks,
>>>> M.
>>>>
>>>


Re: Recomended naming of types to support for schema evolution

2019-12-30 Thread Martin Mucha
Hi, thanks for answer.

I don't understand avro sufficiently and don't know schema registry at all,
actually. So maybe following questions will be dumb.

a) how is schema registry with 5B header different from single object
encoding with 10B header?
b) will schema registry somehow relieve me from having to parse individual
schemas? What if I want to/have to send 2 different version of certain
schema?
c) actually what I have here is (seemingly) pretty similar setup (and btw,
which was recommended here as an alternative to confluent schema registry):
it's a registry without an extra service. Trivial map mapping single object
encoding long[data type] schema fingerprint, pairing schema fingerprint to
schema. So when the bytes "arrive" I can easily read header, find out
fingerprint, get hold onto schema and decode it. Trivial. But the snag is,
that single Schema.Names instance can contain just one Name of given
"identity", and equality is based on fully qualified type, ie. namespace
and name. Thus if you have schema in 2 versions, which does have same
namespace and name, they cannot be parsed using same Parser. Does schema
registry (from confluent platform, right?) work differently than this? Does
this "use it for decoding" process bypasses avros new Schema.Parser().parse
and everything beneath it?

~ I really don't know how this work/should work, as there are close to no
complete actual examples and documentation does not help much. For example
if avro schema evolves from v1 to v2, and the type names and nameschema
aren't the same, how will be the pairing between fields made ?? Completely
puzzling. I need no less then schema evolution with backward and forward
compatibility with schema reuse (ie. no hacks with top level union, but
schema reusing using schema imports). I think I can hack my way through, by
using one parser per set of 1 schema of given version and all needed
imports, which will make everything working (well I don't yet know about
anything which will fail), but it completely does not feel right. And I
would like to know, what is the corret avro way. And I suppose it should be
possible without confluent schema registry, just with single object
encoding as I cannot see any difference between them, but please correct me
if I'm wrong.

thanks,
Mar.

po 30. 12. 2019 v 20:32 odesílatel Lee Hambley 
napsal:

> Hi Martin,
>
> I believe the answer is "just use the schema registry". When you then
> encode for the network your library should give you a binary package with a
> 5 byte header that includes the schema version and name from the registry.
> The reader will when go to the registry and find that schema at that
> version and use it for decoding.
>
> In my experience the naming/etc doesn't matter, only things like defaults
> in enums and things need to be given a thought, but you'll see that for
> yourself with experience.
>
> HTH, Regards,
>
> Lee Hambley
> http://lee.hambley.name/
> +49 (0) 170 298 5667
>
>
> On Mon, 30 Dec 2019 at 17:26, Martin Mucha  wrote:
>
>> Hi,
>> I'm relatively new to avro, and I'm still struggling with getting schema
>> evolution and related issues. But today it should be simple question.
>>
>> What is recommended naming of types if we want to use schema evolution?
>> Should namespace contain some information about version of schema? Or
>> should it be in type itself? Or neither? What is the best practice? Is
>> evolution even possible if namespace/type name is different?
>>
>> I thought that "neither" it's the case, built the app so that version ID
>> is nowhere except for the directory structure, only latest version is
>> compiled to java classes using maven plugin, and parsed all other avsc
>> files in code (to be able to build some sort of schema registry, identify
>> used writer schema using single object encoding and use schema evolution).
>> However I used separate Parser instance to parse each schema. But if one
>> would like to use schema imports, he cannot have separate parser for every
>> schema, and having global one in this setup is also not possible, as each
>> type can be registered just once in org.apache.avro.Schema.Names. Btw. I
>> favored this variant(ie. no ID in name/namespace) because in this setup,
>> after I introduce new schema version, I do not have to change imports in
>> whole project, but just one line in pom.xml saying which directory should
>> be compiled into java files.
>>
>> so what could be the suggestion to correct naming-versioning scheme?
>> thanks,
>> M.
>>
>


Recomended naming of types to support for schema evolution

2019-12-30 Thread Martin Mucha
Hi,
I'm relatively new to avro, and I'm still struggling with getting schema
evolution and related issues. But today it should be simple question.

What is recommended naming of types if we want to use schema evolution?
Should namespace contain some information about version of schema? Or
should it be in type itself? Or neither? What is the best practice? Is
evolution even possible if namespace/type name is different?

I thought that "neither" it's the case, built the app so that version ID is
nowhere except for the directory structure, only latest version is compiled
to java classes using maven plugin, and parsed all other avsc files in code
(to be able to build some sort of schema registry, identify used writer
schema using single object encoding and use schema evolution). However I
used separate Parser instance to parse each schema. But if one would like
to use schema imports, he cannot have separate parser for every schema, and
having global one in this setup is also not possible, as each type can be
registered just once in org.apache.avro.Schema.Names. Btw. I favored this
variant(ie. no ID in name/namespace) because in this setup, after I
introduce new schema version, I do not have to change imports in whole
project, but just one line in pom.xml saying which directory should be
compiled into java files.

so what could be the suggestion to correct naming-versioning scheme?
thanks,
M.


schema evolution with top-level union.

2019-11-14 Thread Martin Mucha
Hi, I encounter weird behavior and have no idea how to fix that. Any
suggestions welcomed.

The issue revolves around union type on top level, which I personally
dislike and consider to be hack, but I understand the motivation behind it:
someone wanted to declare N types withing single avsc file (probably). The
drawback is, that this thing does not support avro schema evolution (read
on). If there is possibility to reshape that avsc, so that multiple files
are somehow available on top-level and evolution works, I'm listening.

Now the code:

old version of schema:

{
  "namespace": "test",
  "name": "TestAvro",
  "type": "record",
  "fields": [
{
  "name": "a",
  "type": "string"
}
  ]
}

updated version of schema, to which former should evolve:

{
  "namespace": "test",
  "name": "TestAvro",
  "type": "record",
  "fields": [
{
  "name": "a",
  "type": "string"
},
{
  "name": "b",
  "type": ["null", "string"],
  "default": null
}
  ]
}


serialization:


private  byte[] serialize(final T data) {
try (ByteArrayOutputStream byteArrayOutputStream = new
ByteArrayOutputStream()) {
Encoder binaryEncoder =
EncoderFactory.get().binaryEncoder(byteArrayOutputStream, null);
DatumWriter datumWriter = new
SpecificDatumWriter<>(data.getSchema());
datumWriter.write(data, binaryEncoder);
binaryEncoder.flush();

return byteArrayOutputStream.toByteArray();
} catch (IOException e) {
throw new RuntimeException(e);
}
}


deserialization:

private static  T
deserializeUsingSchemaEvolution(Class targetType,

 Schema readerSchema,

 Schema writerSchema,

 byte[] data) {
try {
if (data == null) {
return null;
}

DatumReader datumReader = new
SpecificDatumReader<>(writerSchema, readerSchema);
Decoder decoder = DecoderFactory.get().binaryDecoder(data, null);

return targetType.cast(datumReader.read(null, decoder));
} catch (Exception ex) {
throw new SerializationException("Error deserializing data", ex);
}
}


--> this WORKS. Data will be serialized and deserialized, evolution
works as intended.

now put square brackets to both avsc files, ie. add first and last
character to those files, so that first char in those schemata is [
and ] is the last char.


After that, deserialization won't work at all. Errors being produced
vary wildly, depending on avro version and schemata. One can encounter
simple "cannot be deserialized" errors, utf8 cannot be casted to
string errors, or even X cannot be casted to Y, where X and Y are
random types from top-level union, and where such casting makes no
sense.


Any suggestions would be greatly appreciated, as I inherited those
schemata with top-level unions and really don't have any idea how to
make them work.

Thanks,

M.


Re: AVRO schema evolution: adding optional column with default fails deserialization

2019-08-01 Thread Martin Mucha
Thanks for answer!

Ad: "which byte[] are we talking about?" — actually I don't know. Please
lets break it down together.

I'm pretty sure, that we're not using confluent platform(iiuc the paid
bundle, right?). I shared some serializer before [1], so you're saying,
that this wont include neither schema ID, nor schema OK? Ok, lets assume
that. Next. We're using SpringKafka project, to get this serialized data
and send them over kafka. So we don't have any schema registry, but in
principle it could be possible to include schema within each message. But I
cannot see how that could be done. SpringKafka requires us to provide
him 
org.apache.kafka.clients.producer.ProducerConfig#VALUE_SERIALIZER_CLASS_CONFIG,
which we did, but it's just a class calling serializer [1], and from that
point on I have no idea how it could figure out used schema. The question
here I'm asking is, whether when sending avro bytes (obtained by provided
serializer[1]), they are or can be somehow paired with schema used to
serialize data? Is this what kafka senders do, or can do? Include ID/whole
schema somewhere in headers or ...??? And when I read kafka messages, will
the schema be (or could be) somewhere stored in ConsumerRecord or somewhere
like that?

sorry for confused questions, but I'm really missing knowledge to even ask
properly.

thanks,
Martin.

[1]
public static  byte[] serialize(T data,
boolean useBinaryDecoder, boolean pretty) {
try {
if (data == null) {
return new byte[0];
}

log.debug("data='{}'", data);
Schema schema = data.getSchema();
ByteArrayOutputStream byteArrayOutputStream = new
ByteArrayOutputStream();
Encoder binaryEncoder = useBinaryDecoder
?
EncoderFactory.get().binaryEncoder(byteArrayOutputStream, null)
: EncoderFactory.get().jsonEncoder(schema,
byteArrayOutputStream, pretty);

DatumWriter datumWriter = new
GenericDatumWriter<>(schema);
datumWriter.write(data, binaryEncoder);

binaryEncoder.flush();
byteArrayOutputStream.close();

byte[] result = byteArrayOutputStream.toByteArray();
log.debug("serialized data='{}'",
DatatypeConverter.printHexBinary(result));
return result;
} catch (IOException ex) {
throw new SerializationException(
"Can't serialize data='" + data, ex);
}
}

čt 1. 8. 2019 v 17:06 odesílatel Svante Karlsson 
napsal:

> For clarity: What byte[] are we talking about?
>
> You are slightly missing my point if we are speaking about kafka.
>
> Confluent encoding:
>  
> 0  schema_id  avro
>
> avro_binary_payload does not in any case contain the schema or schema id.
> The schema id is a confluent thing. (in an avrofile the schema is prepended
> by value in the file)
>
> While it's trivial to build a schema registry that for example instead
> gives you a md5 hash of the schema you have to use it throughout your
> infrastructure OR use known reader and writer schema (ie hardcoded).
>
> In confluent world the id=N is the N+1'th registered schema in the
> database (a kafka topic) if I remember right. Loose that database and you
> cannot read your kafka topics.
>
> So you have to use some other encoder, homegrown or not that embeds either
> the full schema in every message (expensive) of some id. Does this make
> sense?
>
> /svante
>
>
>
>
>
>
>
>
>
>
> Den tors 1 aug. 2019 kl 16:38 skrev Martin Mucha :
>
>> Thanks for answer.
>>
>> What I knew already is, that in each message there is _somehow_ present
>> either _some_ schema ID or full schema. I saw some byte array manipulations
>> to get _somehow_ defined schema ID from byte[], which worked, but that's
>> definitely not how it should be done. What I'm looking for is some
>> documentation of _how_ to do these things right. I really cannot find a
>> single thing, yet there must be some util functions, or anything. Is there
>> some devel-first-steps page, where can I find answers for:
>>
>> * How to test, whether byte[] contains full schema or just id?
>> * How to control, whether message is serialized with ID or with full
>> schema?
>> * how to get ID from byte[]?
>> * how to get full schema from byte[]?
>>
>> I don't have confluent platform, and cannot have it, but implementing
>> "get schema by ID" should be easy task, provided, that I have that ID. In
>> my scenario I know, that message will be written using one schema, just
>> different versions of it. So I just need to know, which version it is, so
>> that I can configure deserializer to enable schema evolution.
>>
>

Re: AVRO schema evolution: adding optional column with default fails deserialization

2019-08-01 Thread Martin Mucha
Thanks for answer.

What I knew already is, that in each message there is _somehow_ present
either _some_ schema ID or full schema. I saw some byte array manipulations
to get _somehow_ defined schema ID from byte[], which worked, but that's
definitely not how it should be done. What I'm looking for is some
documentation of _how_ to do these things right. I really cannot find a
single thing, yet there must be some util functions, or anything. Is there
some devel-first-steps page, where can I find answers for:

* How to test, whether byte[] contains full schema or just id?
* How to control, whether message is serialized with ID or with full schema?
* how to get ID from byte[]?
* how to get full schema from byte[]?

I don't have confluent platform, and cannot have it, but implementing "get
schema by ID" should be easy task, provided, that I have that ID. In my
scenario I know, that message will be written using one schema, just
different versions of it. So I just need to know, which version it is, so
that I can configure deserializer to enable schema evolution.

thanks in advance,
Martin

čt 1. 8. 2019 v 15:55 odesílatel Svante Karlsson 
napsal:

> In an avrofile the schema is in the beginning but if you refer a single
> record serialization like Kafka then you have to add something that you can
> use to get hold of the schema. Confluents avroencoder for Kafka uses
> confluents schema registry that uses int32 as schema Id. This is prepended
> (+a magic byte) to the binary avro. Thus using the schema registry again
> you can get the writer schema.
>
> /Svante
>
> On Thu, Aug 1, 2019, 15:30 Martin Mucha  wrote:
>
>> Hi,
>>
>> just one more question, not strictly related to the subject.
>>
>> Initially I though I'd be OK with using some initial version of schema in
>> place of writer schema. That works, but all columns from schema older than
>> this initial one would be just ignored. So I need to know EXACTLY the
>> schema, which writer used. I know, that avro messages contains either full
>> schema or at least it's ID. Can you point me to the documentation, where
>> this is discussed? So in my deserializer I have byte[] as a input, from
>> which I need to get the schema information first, in order to be able to
>> deserialize the record. I really do not know how to do that, I'm pretty
>> sure I never saw this anywhere, and I cannot find it anywhere. But in
>> principle it must be possible, since reader need not necessarily have any
>> control of which schema writer used.
>>
>> thanks a lot.
>> M.
>>
>> út 30. 7. 2019 v 18:16 odesílatel Martin Mucha 
>> napsal:
>>
>>> Thank you very much for in depth answer. I understand how it works now
>>> better, will test it shortly.
>>> Thank you for your time.
>>>
>>> Martin.
>>>
>>> út 30. 7. 2019 v 17:09 odesílatel Ryan Skraba  napsal:
>>>
>>>> Hello!  It's the same issue in your example code as allegro, even with
>>>> the SpecificDatumReader.
>>>>
>>>> This line: datumReader = new SpecificDatumReader<>(schema)
>>>> should be: datumReader = new SpecificDatumReader<>(originalSchema,
>>>> schema)
>>>>
>>>> In Avro, the original schema is commonly known as the writer schema
>>>> (the instance that originally wrote the binary data).  Schema
>>>> evolution applies when you are using the constructor of the
>>>> SpecificDatumReader that takes *both* reader and writer schemas.
>>>>
>>>> As a concrete example, if your original schema was:
>>>>
>>>> {
>>>>   "type": "record",
>>>>   "name": "Simple",
>>>>   "fields": [
>>>> {"name": "id", "type": "int"},
>>>> {"name": "name","type": "string"}
>>>>   ]
>>>> }
>>>>
>>>> And you added a field:
>>>>
>>>> {
>>>>   "type": "record",
>>>>   "name": "SimpleV2",
>>>>   "fields": [
>>>> {"name": "id", "type": "int"},
>>>> {"name": "name", "type": "string"},
>>>> {"name": "description","type": ["null", "string"]}
>>>>   ]
>>>> }
>>>>
>>>> You could do the following safely, assuming that Simple and SimpleV2
>>>> class

Re: AVRO schema evolution: adding optional column with default fails deserialization

2019-08-01 Thread Martin Mucha
Hi,

just one more question, not strictly related to the subject.

Initially I though I'd be OK with using some initial version of schema in
place of writer schema. That works, but all columns from schema older than
this initial one would be just ignored. So I need to know EXACTLY the
schema, which writer used. I know, that avro messages contains either full
schema or at least it's ID. Can you point me to the documentation, where
this is discussed? So in my deserializer I have byte[] as a input, from
which I need to get the schema information first, in order to be able to
deserialize the record. I really do not know how to do that, I'm pretty
sure I never saw this anywhere, and I cannot find it anywhere. But in
principle it must be possible, since reader need not necessarily have any
control of which schema writer used.

thanks a lot.
M.

út 30. 7. 2019 v 18:16 odesílatel Martin Mucha  napsal:

> Thank you very much for in depth answer. I understand how it works now
> better, will test it shortly.
> Thank you for your time.
>
> Martin.
>
> út 30. 7. 2019 v 17:09 odesílatel Ryan Skraba  napsal:
>
>> Hello!  It's the same issue in your example code as allegro, even with
>> the SpecificDatumReader.
>>
>> This line: datumReader = new SpecificDatumReader<>(schema)
>> should be: datumReader = new SpecificDatumReader<>(originalSchema, schema)
>>
>> In Avro, the original schema is commonly known as the writer schema
>> (the instance that originally wrote the binary data).  Schema
>> evolution applies when you are using the constructor of the
>> SpecificDatumReader that takes *both* reader and writer schemas.
>>
>> As a concrete example, if your original schema was:
>>
>> {
>>   "type": "record",
>>   "name": "Simple",
>>   "fields": [
>> {"name": "id", "type": "int"},
>> {"name": "name","type": "string"}
>>   ]
>> }
>>
>> And you added a field:
>>
>> {
>>   "type": "record",
>>   "name": "SimpleV2",
>>   "fields": [
>> {"name": "id", "type": "int"},
>> {"name": "name", "type": "string"},
>> {"name": "description","type": ["null", "string"]}
>>   ]
>> }
>>
>> You could do the following safely, assuming that Simple and SimpleV2
>> classes are generated from the avro-maven-plugin:
>>
>> @Test
>> public void testSerializeDeserializeEvolution() throws IOException {
>>   // Write a Simple v1 to bytes using your exact method.
>>   byte[] v1AsBytes = serialize(new Simple(1, "name1"), true, false);
>>
>>   // Read as Simple v2, same as your method but with the writer and
>> reader schema.
>>   DatumReader datumReader =
>>   new SpecificDatumReader<>(Simple.getClassSchema(),
>> SimpleV2.getClassSchema());
>>   Decoder decoder = DecoderFactory.get().binaryDecoder(v1AsBytes, null);
>>   SimpleV2 v2 = datumReader.read(null, decoder);
>>
>>   assertThat(v2.getId(), is(1));
>>   assertThat(v2.getName(), is(new Utf8("name1")));
>>   assertThat(v2.getDescription(), nullValue());
>> }
>>
>> This demonstrates with two different schemas and SpecificRecords in
>> the same test, but the same principle applies if it's the same record
>> that has evolved -- you need to know the original schema that wrote
>> the data in order to apply the schema that you're now using for
>> reading.
>>
>> I hope this clarifies what you are looking for!
>>
>> All my best, Ryan
>>
>>
>>
>> On Tue, Jul 30, 2019 at 3:30 PM Martin Mucha  wrote:
>> >
>> > Thanks for answer.
>> >
>> > Actually I have exactly the same behavior with avro 1.9.0 and following
>> deserializer in our other app, which uses strictly avro codebase, and
>> failing with same exceptions. So lets leave "allegro" library and lots of
>> other tools out of it in our discussion.
>> > I can use whichever aproach. All I need is single way, where I can
>> deserialize byte[] into class generated by avro-maven-plugin, and which
>> will respect documentation regarding schema evolution. Currently we're
>> using following deserializer and serializer, and these does not work when
>> it comes to schema evolution. What is the correct way to serialize and
>> deserializer avro data?

Re: AVRO schema evolution: adding optional column with default fails deserialization

2019-07-30 Thread Martin Mucha
Thank you very much for in depth answer. I understand how it works now
better, will test it shortly.
Thank you for your time.

Martin.

út 30. 7. 2019 v 17:09 odesílatel Ryan Skraba  napsal:

> Hello!  It's the same issue in your example code as allegro, even with
> the SpecificDatumReader.
>
> This line: datumReader = new SpecificDatumReader<>(schema)
> should be: datumReader = new SpecificDatumReader<>(originalSchema, schema)
>
> In Avro, the original schema is commonly known as the writer schema
> (the instance that originally wrote the binary data).  Schema
> evolution applies when you are using the constructor of the
> SpecificDatumReader that takes *both* reader and writer schemas.
>
> As a concrete example, if your original schema was:
>
> {
>   "type": "record",
>   "name": "Simple",
>   "fields": [
> {"name": "id", "type": "int"},
> {"name": "name","type": "string"}
>   ]
> }
>
> And you added a field:
>
> {
>   "type": "record",
>   "name": "SimpleV2",
>   "fields": [
> {"name": "id", "type": "int"},
> {"name": "name", "type": "string"},
> {"name": "description","type": ["null", "string"]}
>   ]
> }
>
> You could do the following safely, assuming that Simple and SimpleV2
> classes are generated from the avro-maven-plugin:
>
> @Test
> public void testSerializeDeserializeEvolution() throws IOException {
>   // Write a Simple v1 to bytes using your exact method.
>   byte[] v1AsBytes = serialize(new Simple(1, "name1"), true, false);
>
>   // Read as Simple v2, same as your method but with the writer and
> reader schema.
>   DatumReader datumReader =
>   new SpecificDatumReader<>(Simple.getClassSchema(),
> SimpleV2.getClassSchema());
>   Decoder decoder = DecoderFactory.get().binaryDecoder(v1AsBytes, null);
>   SimpleV2 v2 = datumReader.read(null, decoder);
>
>   assertThat(v2.getId(), is(1));
>   assertThat(v2.getName(), is(new Utf8("name1")));
>   assertThat(v2.getDescription(), nullValue());
> }
>
> This demonstrates with two different schemas and SpecificRecords in
> the same test, but the same principle applies if it's the same record
> that has evolved -- you need to know the original schema that wrote
> the data in order to apply the schema that you're now using for
> reading.
>
> I hope this clarifies what you are looking for!
>
> All my best, Ryan
>
>
>
> On Tue, Jul 30, 2019 at 3:30 PM Martin Mucha  wrote:
> >
> > Thanks for answer.
> >
> > Actually I have exactly the same behavior with avro 1.9.0 and following
> deserializer in our other app, which uses strictly avro codebase, and
> failing with same exceptions. So lets leave "allegro" library and lots of
> other tools out of it in our discussion.
> > I can use whichever aproach. All I need is single way, where I can
> deserialize byte[] into class generated by avro-maven-plugin, and which
> will respect documentation regarding schema evolution. Currently we're
> using following deserializer and serializer, and these does not work when
> it comes to schema evolution. What is the correct way to serialize and
> deserializer avro data?
> >
> > I probably don't understand your mention about GenericRecord or
> GenericDatumReader. I tried to use GenericDatumReader in deserializer
> below, but then it seems I got back just GenericData$Record instance, which
> I can use then to access array of instances, which is not what I'm looking
> for(IIUC), since in that case I could have just use plain old JSON and
> deserialize it using jackson having no schema evolution problems at all. If
> that's correct, I'd rather stick to SpecificDatumReader, and somehow fix it
> if possible.
> >
> > What can be done? Or how schema evolution is intended to be used? I
> found a lots of question searching for this answer.
> >
> > thanks!
> > Martin.
> >
> > deserializer:
> >
> > public static  T deserialize(Class
> targetType,
> >byte[]
> data,
> >boolean
> useBinaryDecoder) {
> > try {
> > if (data == null) {
> > return null;
> > }
> >
> > log.trace("data='{}'",
> DatatypeConverter.pri

Re: AVRO schema evolution: adding optional column with default fails deserialization

2019-07-30 Thread Martin Mucha
Thanks for answer.

Actually I have exactly the same behavior with avro 1.9.0 and following
deserializer in our other app, which uses strictly avro codebase, and
failing with same exceptions. So lets leave "allegro" library and lots of
other tools out of it in our discussion.
I can use whichever aproach. All I need is single way, where I can
deserialize byte[] into class generated by avro-maven-plugin, and which
will respect documentation regarding schema evolution. Currently we're
using following deserializer and serializer, and these does not work when
it comes to schema evolution. What is the correct way to serialize and
deserializer avro data?

I probably don't understand your mention about GenericRecord or
GenericDatumReader. I tried to use GenericDatumReader in deserializer
below, but then it seems I got back just GenericData$Record instance, which
I can use then to access array of instances, which is not what I'm looking
for(IIUC), since in that case I could have just use plain old JSON and
deserialize it using jackson having no schema evolution problems at all. If
that's correct, I'd rather stick to SpecificDatumReader, and somehow fix it
if possible.

What can be done? Or how schema evolution is intended to be used? I found a
lots of question searching for this answer.

thanks!
Martin.

deserializer:

public static  T deserialize(Class
targetType,
   byte[] data,
   boolean
useBinaryDecoder) {
try {
if (data == null) {
return null;
}

log.trace("data='{}'", DatatypeConverter.printHexBinary(data));

Schema schema = targetType.newInstance().getSchema();
DatumReader datumReader = new
SpecificDatumReader<>(schema);
Decoder decoder = useBinaryDecoder
? DecoderFactory.get().binaryDecoder(data, null)
: DecoderFactory.get().jsonDecoder(schema, new
String(data));

T result = targetType.cast(datumReader.read(null, decoder));
log.trace("deserialized data='{}'", result);
return result;
} catch (Exception ex) {
throw new SerializationException("Error deserializing data",
ex);
}
}

serializer:
public static  byte[] serialize(T data,
boolean useBinaryDecoder, boolean pretty) {
try {
if (data == null) {
return new byte[0];
}

log.debug("data='{}'", data);
Schema schema = data.getSchema();
ByteArrayOutputStream byteArrayOutputStream = new
ByteArrayOutputStream();
Encoder binaryEncoder = useBinaryDecoder
?
EncoderFactory.get().binaryEncoder(byteArrayOutputStream, null)
: EncoderFactory.get().jsonEncoder(schema,
byteArrayOutputStream, pretty);

DatumWriter datumWriter = new
GenericDatumWriter<>(schema);
datumWriter.write(data, binaryEncoder);

binaryEncoder.flush();
byteArrayOutputStream.close();

byte[] result = byteArrayOutputStream.toByteArray();
log.debug("serialized data='{}'",
DatatypeConverter.printHexBinary(result));
return result;
} catch (IOException ex) {
throw new SerializationException(
"Can't serialize data='" + data, ex);
}
}

út 30. 7. 2019 v 13:48 odesílatel Ryan Skraba  napsal:

> Hello!  Schema evolution relies on both the writer and reader schemas
> being available.
>
> It looks like the allegro tool you are using is using the
> GenericDatumReader that assumes the reader and writer schema are the
> same:
>
>
> https://github.com/allegro/json-avro-converter/blob/json-avro-converter-0.2.8/converter/src/main/java/tech/allegro/schema/json2avro/converter/JsonAvroConverter.java#L83
>
> I do not believe that the "default" value is taken into account for
> data that is strictly missing from the binary input, just when a field
> is known to be in the reader schema but missing from the original
> writer.
>
> You may have more luck reading the GenericRecord with a
> GenericDatumReader with both schemas, and using the
> `convertToJson(record)`.
>
> I hope this is useful -- Ryan
>
>
>
> On Tue, Jul 30, 2019 at 10:20 AM Martin Mucha  wrote:
> >
> > Hi,
> >
> > I've got some issues/misunderstanding of AVRO schema evolution.
> >
> > When reading through avro documentation, for example [1], I understood,
> that schema evolution is supported, and if I added column with specified
> default, it should be backwards compatible (and even forward when I remove
> it again). Sounds great, so I added column defined as:
> &g

Re: is it possible to deserialize JSON with optional field?

2019-04-16 Thread Martin Mucha
Hi, thanks for responding.

I know that you promote your fork, however considering I might not be able
to move away from "official release", is there an easy way how to consume
this? Since I cannot see it ...

Maybe side question: official avro seems to be dead. There are some commits
made, but last release happened 2 years ago, fatal flaws are not being
addressed, almost 10 years old valid bug reports are just ignored, ... Does
anyone know about any sign/confirmation that avro community will be moving
toward something more viable?

M.

po 15. 4. 2019 v 15:17 odesílatel Zoltan Farkas 
napsal:

> It is possible to do it with a custom JsonDecoder.
>
> I wrote one that does this at:
> https://github.com/zolyfarkas/avro/blob/trunk/lang/java/avro/src/main/java/org/apache/avro/io/ExtendedJsonDecoder.java
>
>
> hope it helps.
>
>
> —Z
>
> On Apr 13, 2019, at 9:24 AM, Martin Mucha  wrote:
>
> Hi,
>
> is it possible by design to deserialize JSON with schema having optional
> value?
> Schema:
>
> {
>  "type" : "record",
>  "name" : "UserSessionEvent",
>  "namespace" : "events",
>  "fields" : [ {
>"name" : "username",
>"type" : "string"
>  }, {
>"name" : "errorData",
>"type" : [ "null", "string" ],
>"default" : null
>  }]}
>
> Value to deserialize:
>
> {"username" : "2271AE67-34DE-4B43-8839-07216C5D10E1"}
>
> I also tried to change order of type, but that changed nothing. I know I
> can produce ill-formated JSON which could be deserialized, but that's not
> acceptable. AFAIK given JSON with required `username` and optional
> `errorData` cannot be deserialized by design. Am I right?
>
> thanks.
>
>
>


is it possible to deserialize JSON with optional field?

2019-04-13 Thread Martin Mucha
Hi,

is it possible by design to deserialize JSON with schema having optional
value?
Schema:

{
 "type" : "record",
 "name" : "UserSessionEvent",
 "namespace" : "events",
 "fields" : [ {
   "name" : "username",
   "type" : "string"
 }, {
   "name" : "errorData",
   "type" : [ "null", "string" ],
   "default" : null
 }]}

Value to deserialize:

{"username" : "2271AE67-34DE-4B43-8839-07216C5D10E1"}

I also tried to change order of type, but that changed nothing. I know I
can produce ill-formated JSON which could be deserialized, but that's not
acceptable. AFAIK given JSON with required `username` and optional
`errorData` cannot be deserialized by design. Am I right?

thanks.


Re: how to do non-optional field in nested object?

2017-11-27 Thread Martin Mucha
Following schema:
{
  "name" : "ARecord",
  "type" : "record",
  "namespace" : "AAA",
  "fields" : [
{"name": "A", "type": "string" }
  ]
}


does not validate json

{}

as valid one. It's invalid. A is required, as would be expected by me.

---
I'm getting lost here.
What is the resolution? I kinda wanted to use avro schema to validate
described JSON, and now based on what you said IIUC validation in "xschema
style": "1 of this, followed by 2 of that" is not possible with avro
schema. Correct?


Btw. what do you use to validate JSON using avro? I used avro-utils.jar
executed from command line, which proved as incapable of deserializing &
validating optional fields if they are set (if optional field is set, I
have to pass value in json like: {"string":"value"}). So now I'm using for
testing purposes actual flow through NIFI, which is extremely cumbersome.

2017-11-27 16:19 GMT+01:00 Dan Schmitt <dan.schm...@gmail.com>:

> The top level object in all the examples is a record (of which you can
> have 0 or more.)
>
> So, right now, even the top level is failing the spec:
>
> IV) valid (0 ARecords):
> { }
>
> V) valid (2 ARecords):
> {
>   "id": "...",
>   "B": {
> "C": "..."
>   }
> } ,
> "id": "...",
>   "B": {
> "C": "..."
>   }
> }
>
> On Mon, Nov 27, 2017 at 9:47 AM, Martin Mucha <alfon...@gmail.com> wrote:
> > I don't understand where "or more" is comming from.
>
> Because of the use of nested records.  Anywhere you put a record you can
> have 0,
> or more than 1.  I don't know of a pure schema way to enforce only
> 1/optional.
>
> If you are doing something with any of the APIs you could add a validation
> step
> that says "must have only 1 ARecord, and must have only 1 BRecord" after
> you
> read the data and throw an error for the 0 or 1+ situations, but you'd
> need to write
> some code somewhere with one of the APIs and build your own validator.
>


Re: how to do non-optional field in nested object?

2017-11-27 Thread Martin Mucha
I don't understand where "or more" is comming from.

IIUC (and I need not), there's just one top-level json object. If so,
there's 1 ARecord. ARecord has required ID field, thus has 1 ID field. Then
it has second field BRecord, one of them. This Brecord has 2 required
fields, B and C, each should come exactly once. Right?

What I'm after is:

I) valid:


{
  "id": "..."
}

II) invalid:

{
  "id": "...",
  "B": {

  }
}

III) valid:

{
  "id": "...",
  "B": {
"C": "..."
  }
}

IV) everything else is invalid.

2017-11-27 15:39 GMT+01:00 Dan Schmitt <dan.schm...@gmail.com>:

> The problem is BRecord can be 0 or more and you still end up with the a
> valid B.
>
> How about
>
> {
>   "name" : "ARecord",
>   "type" : "record",
>   "namespace" : "A",
>   "fields" : [
> {"name": "id", "type": "string" },
> {
>   "name": "BRecord",
>   "type": "record",
>   "fields": [
>   { "name": "B", "type": "string" },
>   { "name": "C", "type": "string" }
>     ]
>   }
> }
>   ]
> }
>
> This gives me 0 or more ARecords, each with and id, and 0 or more BRecords
> associated with each ARecord each with a B and C.  If you wanted one or
> more
> C's I don't see a trivial clean way to do that (you could add a
> Cextras array to the
> BRecord to get 0 or more additional C things, but that feels unclean.)
>
>
> On Mon, Nov 27, 2017 at 9:10 AM, Martin Mucha <alfon...@gmail.com> wrote:
> > Thanks for reply.
> >
> > Sadly it does not work that way (here). Even:
> >
> > {
> >   "name" : "ARecord",
> >   "type" : "record",
> >   "namespace" : "A",
> >   "fields" : [
> > {"name": "id", "type": "string" },
> > {
> >   "name": "B",
> >   "type":  {
> >     "type": "record",
> > "name": "BRecord",
> > "fields": [
> >   {
> > "name": "C",
> > "type": "string"
> >   }
> > ]
> >   }
> > }
> >   ]
> > }
> >
> > does not require C. And that's not what I want ... I'd like optional B,
> and
> > once user provide B, then B.C is required.
> >
> > Martin.
> >
> >
> > 2017-11-27 15:06 GMT+01:00 Dan Schmitt <dan.schm...@gmail.com>:
> >>
> >>   "name": "B",
> >>   "type": ["null", {
> >>
> >> The [] union lets you do null or a BRecord, your JSON does null.
> >> Pull the null from the union and it will require the C.
> >>
> >> On Mon, Nov 27, 2017 at 9:00 AM, Martin Mucha <alfon...@gmail.com>
> wrote:
> >> > Hi,
> >> >
> >> > I have this avro schema:
> >> >
> >> > {
> >> >   "name" : "ARecord",
> >> >   "type" : "record",
> >> >   "namespace" : "A",
> >> >   "fields" : [
> >> > {"name": "id", "type": "string" },
> >> > {
> >> >   "name": "B",
> >> >   "type": ["null", {
> >> > "type": "record",
> >> > "name": "BRecord",
> >> > "fields": [
> >> >   {
> >> > "name": "C",
> >> > "type": "string"
> >> >   }
> >> > ]
> >> >   }]
> >> > }
> >> >   ]
> >> > }
> >> >
> >> >
> >> > and following JSON, which validates against it:
> >> >
> >> > {
> >> >   "id": "...",
> >> >   "B": {
> >> >
> >> >   }
> >> > }
> >> >
> >> >
> >> > I would expect, that C is required. Why it's not? What shall I do to
> >> > make it
> >> > required?
> >> >
> >> > Thanks!
> >> > Martin.
> >
> >
>


Re: how to do non-optional field in nested object?

2017-11-27 Thread Martin Mucha
Thanks for reply.

Sadly it does not work that way (here). Even:

{
  "name" : "ARecord",
  "type" : "record",
  "namespace" : "A",
  "fields" : [
{"name": "id", "type": "string" },
{
  "name": "B",
  "type":  {
"type": "record",
"name": "BRecord",
"fields": [
  {
"name": "C",
"type": "string"
  }
]
  }
}
  ]
}

does not require C. And that's not what I want ... I'd like optional B, and
once user provide B, then B.C is required.

Martin.


2017-11-27 15:06 GMT+01:00 Dan Schmitt <dan.schm...@gmail.com>:

>   "name": "B",
>   "type": ["null", {
>
> The [] union lets you do null or a BRecord, your JSON does null.
> Pull the null from the union and it will require the C.
>
> On Mon, Nov 27, 2017 at 9:00 AM, Martin Mucha <alfon...@gmail.com> wrote:
> > Hi,
> >
> > I have this avro schema:
> >
> > {
> >   "name" : "ARecord",
> >   "type" : "record",
> >   "namespace" : "A",
> >   "fields" : [
> > {"name": "id", "type": "string" },
> > {
> >   "name": "B",
> >   "type": ["null", {
> > "type": "record",
> > "name": "BRecord",
> > "fields": [
> >   {
> > "name": "C",
> > "type": "string"
> >   }
> > ]
> >   }]
> > }
> >   ]
> > }
> >
> >
> > and following JSON, which validates against it:
> >
> > {
> >   "id": "...",
> >   "B": {
> >
> >   }
> > }
> >
> >
> > I would expect, that C is required. Why it's not? What shall I do to
> make it
> > required?
> >
> > Thanks!
> > Martin.
>


how to do non-optional field in nested object?

2017-11-27 Thread Martin Mucha
Hi,

I have this avro schema:

{
  "name" : "ARecord",
  "type" : "record",
  "namespace" : "A",
  "fields" : [
{"name": "id", "type": "string" },
{
  "name": "B",
  "type": ["null", {
"type": "record",
"name": "BRecord",
"fields": [
  {
"name": "C",
"type": "string"
  }
]
  }]
}
  ]
}


and following JSON, which validates against it:

{
  "id": "...",
  "B": {

  }
}


I would expect, that C is required. Why it's not? What shall I do to make
it required?

Thanks!
Martin.