Re: Companies using Apache Avro

2021-01-26 Thread Zoltan Farkas
I think LinkedIn a fairly well known user…
They have several blogs on how they use avro, and some work like: 
https://github.com/linkedin/avro-util 

—Z

> On Jan 25, 2021, at 3:27 PM, Juan Cruz Viotti  wrote:
> 
> Hey there!
> 
> Do you know where can I find a list of relatively well-known companies
> that make use of Apache Avro? I'm trying to collect a small list for
> research purposes and my search is not yielding many results apart from
> Facebook.
> 
> Thanks in advance,
> 
> -- 
> Juan Cruz Viotti
> Software Engineer
> https://www.jviotti.com



Re: Avro vs Openapi

2020-08-19 Thread Zoltan Farkas
Hi Rune,

I have some code on github where I experiment with avro, open api, and more 
that might help you.

Here is the open api model converter implementation (avro schema 2 opena api 
model): 
https://github.com/zolyfarkas/spf4j-jaxrs/blob/master/spf4j-jaxrs-open-api/src/main/java/org/spf4j/actuator/openApi/AvroModelConverter.java
 

 

here a example(demo) project  that uses it: 
https://github.com/zolyfarkas/jaxrs-spf4j-demo 
  (see wiki for contents)
you can also see it running at: https://demo.spf4j.org/apiBrowser/index.html 


hope it helps…

cheers.

—Z


> On Aug 18, 2020, at 12:31 PM, Rune Gellein  wrote:
> 
> Hi,
> I am creating a restful webservice, yes.  OpenAPI is working well.
> 
> One thing that is a bit awkward in the OpenAPI is to extend a type. far as I 
> can work out it all have to be in the same file there. I was hoping this is 
> easier in Avro where you can use multiple files for the schema (at least I 
> think you can). 
> 
> But then there is also the problem that I haven't been able to do the code 
> generation of this swagger schema with Avro...
> 
> regards,
> Rune
> 
> On 2020/08/18 03:46:41, Patrick Farry  wrote: 
>> Hi Rune,
>> 
>> Are you doing OpenApi for Rest API’s? If so, code generation using OpenAPI 
>> code gen is straightforward and easy to customize. 
>> 
>> If you are looking to serialize objects for streaming, messaging or for 
>> file/object storage then Avro might be the right thing.
>> 
>> 
>> Sent from my iPhone
>> 
>>> On Aug 14, 2020, at 7:54 AM, Rune Gellein  wrote:
>>> 
>>> Hi,
>>> I am relatively new to the world of Json schemas.  I have been tasked with 
>>> doing the code generation for a swagger schema we are about to start to use 
>>> where I am working.
>>> The code generation is  fine with OpenAPI. However I think Avro might have 
>>> some advantages when it comes to extensions, so I wanted to try that too.
>>> Using the latest XMLSpy from Altova I managed to validate a test message 
>>> against the swagger schema if I loaded it as an Avro schema.  However when 
>>> I try with the avro-maven-plugin I get errors.
>>> 
>>> Any idea why.  Is swagger schemas meant to work with Avro?  Are they 
>>> compatible.  It has been difficult to find any information on this on 
>>> google.
>>> I think XMLSpy is using Avro 1.8 and my plugin is version 1.10.
>>> 
>>> regards,
>>> Rune
>> 



Re: Decimal type, limitation on scale

2020-03-02 Thread Zoltan Farkas
+dev adding the dev mailing list, maybe somebody there can answer the reasoning.

when comparing sql server with Oracle and Postgress: 

https://docs.microsoft.com/en-us/sql/t-sql/data-types/decimal-and-numeric-transact-sql?view=sql-server-ver15
 

 

https://docs.oracle.com/cd/A84870_01/doc/server.816/a76965/c10datyp.htm#743 

https://www.postgresql.org/docs/9.1/datatype-numeric.html 



One allows for negative scale, the other doesn’t.
My biggest issue with the current decimal spec is that it does not encode the 
scale (uses the scale defined in the schema), as such it cannot accommodate a 
Oracle and Postgres NUMBER without scale coercion. 

there are other differences (like NAN, …) 

But there is no reason why the decimal2 logical type should not be created to 
address the above…

or even better promote decimal to a first class 
type...https://issues.apache.org/jira/browse/AVRO-2164 
 


—Z

> On Mar 2, 2020, at 2:34 PM, Christopher Egerton  wrote:
> 
> Hi all,
> 
> I've been trying to do some research on the logical decimal type and why the 
> scale of a decimal type must be between zero and the precision of the type, 
> inclusive. The ticket https://issues.apache.org/jira/browse/AVRO-1402 
>  has a lot of discussion 
> around the design of the type, but I haven't been able to find any rationale 
> for the limitations on the scale of the type.
> 
> These don't appear to align with existing conventions for precision and scale 
> in the context of SQL numeric types, the JDBC API, and the Java standard 
> library's BigDecimal class. In these contexts, the precision must be a 
> positive number, but the scale can be any value--positive (representing the 
> number of digits of precision that are available after the decimal point), 
> negative (representing the number of trailing zeroes at the end of the number 
> before an implicit decimal point), or zero. It is not bounded by the 
> precision of the type.
> 
> The definitions for scale and precision appear to align across these 
> contexts, including the Avro spec, so I'm curious as to why the Avro 
> spec--seemingly an anomaly--is the only one to declare these limitations on 
> what the scale of a decimal type can be.
> 
> Does anyone know why these exist, and if not, would it be okay to file a 
> ticket to remove them from the spec and begin work on it?
> 
> Cheers,
> 
> Chris



Re: More idiomatic JSON encoding for unions

2020-01-16 Thread Zoltan Farkas
I have hacked logical types in my fork to add this capability, if you want to 
take a look see:
https://github.com/zolyfarkas/avro/blob/trunk/lang/java/avro/src/main/java/org/apache/avro/LogicalType.java#L78
 
<https://github.com/zolyfarkas/avro/blob/trunk/lang/java/avro/src/main/java/org/apache/avro/LogicalType.java#L78>
 

my goal was to make decimal being a number in json.
but it is a hack, it works but won’t win any beauty contests :-) and right now 
I don’t see how to make this clean to the point of being something that would 
be accepted main-stream.

It would be a lot cleaner to elevate these logical types to first class types, 
and standardize the encoding appropriately.
decimal clearly needs to be a first class type, not sure about 
timestamp-micros...

—Z


> On Jan 16, 2020, at 2:20 PM, roger peppe  wrote:
> 
> On Thu, 16 Jan 2020, 18:59 Zoltan Farkas,  <mailto:zolyfar...@yahoo.com>> wrote:
> answers inline
> 
>> On Jan 16, 2020, at 5:51 AM, roger peppe > <mailto:rogpe...@gmail.com>> wrote:
>> 
>> On Wed, 15 Jan 2020 at 18:51, Zoltan Farkas > <mailto:zolyfar...@yahoo.com>> wrote:
>> What I mean with timestamp-micros, is that it is currently restricted to 
>> being bound to long,
>> I see no reason why it should not be allowed to be bound to string as well. 
>> (the change should be simple to implement)
>> 
>> Wouldn't have the implication of changing the binary representation too, 
>> which is not necessarily desirable (it's bulkier, slower to decode and has 
>> more potential error cases) ?
> 
> yes, it would, but this is how logical types work, and I see no good way to 
> change this.  (this is what i meant by paying the readability cost in place 
> where it is irrelevant)
> 
> So you think that the JSON representation should always match the underlying 
> type and ignore the logical type? I can understand the reasoning behind that, 
> but it doesn't feel very user friendly in some cases (thinking of decimal and 
> duration in particular).
> 
> Given their privileged place in the specification, I was thinking that some 
> logical types could gain privilege here.
> 
> Aside: I'm a bit concerned about the potential for data corruption from 
> interchange between timestamp-micros and timestamp-millis, which, as far as 
> understand the spec, look like they'll be treated as compatible with each 
> other.
> 
> 
>> 
>> 
>> regarding the media type, something like: application/avro.2+json would be 
>> fine.
>> 
>> Attaching the ".2" to "avro" rather than "json" seems to be implying a new 
>> Avro version, rather than a new JSON-encoding version? Or is the idea that 
>> the version number here is implying both the JSON-encoding version and the 
>> underlying Avro version?  The MIME standard seems to be silent on this 
>> AFAICS.
>> 
> 
> the reason why I would use +json at the end is because it would be a subtype 
> sufix: https://en.wikipedia.org/wiki/Media_type#Suffix 
> <https://en.wikipedia.org/wiki/Media_type#Suffix> and most browsers will 
> recognize it as json, and potentially format it...
> 
> Ah, nice, I wasn't aware of RFC 6838.
> 
>> 
>> Other then that the proposal looks good. can you start a PR with the spec 
>> update?
>> 
>> I can do, but I don't hold out much hope of it getting merged. I started a 
>> PR with a much more minor change <https://github.com/apache/avro/pull/738> 
>> almost 2 months ago and haven't seen any response yet.
> 
> Send out a email on the dev mailing list, the committers seem more responsive 
> lately...
> 
> I'll give it a go :)
> 
>   cheers,
> rog.
> 
>> 
>>   cheers,
>> rog.
>> 
>> —Z
>> 
>>> On Jan 15, 2020, at 12:30 PM, roger peppe >> <mailto:rogpe...@gmail.com>> wrote:
>>> 
>>> On Wed, 15 Jan 2020 at 16:27, Zoltan Farkas >> <mailto:zolyfar...@yahoo.com>> wrote:
>>> See comments in-line below:
>>> 
>>>> On Jan 15, 2020, at 3:42 AM, roger peppe >>> <mailto:rogpe...@gmail.com>> wrote:
>>>> 
>>>> Oops, I left arrays out! Two other thoughts: 
>>>> 
>>>> I wonder if it might be worth hedging bets about logical types. It would 
>>>> be nice if (for example) a `timestamp-micros` value could be encoded as an 
>>>> RFC3339 string, so perhaps that should be allowed for, but maybe that's a 
>>>> step too far.
>>> I think logical types should should stay above the encoding/decoding…  
>>> With timestamp-micros we could extend it to make it applicab

Re: More idiomatic JSON encoding for unions

2020-01-16 Thread Zoltan Farkas
answers inline

> On Jan 16, 2020, at 5:51 AM, roger peppe  wrote:
> 
> On Wed, 15 Jan 2020 at 18:51, Zoltan Farkas  <mailto:zolyfar...@yahoo.com>> wrote:
> What I mean with timestamp-micros, is that it is currently restricted to 
> being bound to long,
> I see no reason why it should not be allowed to be bound to string as well. 
> (the change should be simple to implement)
> 
> Wouldn't have the implication of changing the binary representation too, 
> which is not necessarily desirable (it's bulkier, slower to decode and has 
> more potential error cases) ?

yes, it would, but this is how logical types work, and I see no good way to 
change this.  (this is what i meant by paying the readability cost in place 
where it is irrelevant)

> 
> 
> regarding the media type, something like: application/avro.2+json would be 
> fine.
> 
> Attaching the ".2" to "avro" rather than "json" seems to be implying a new 
> Avro version, rather than a new JSON-encoding version? Or is the idea that 
> the version number here is implying both the JSON-encoding version and the 
> underlying Avro version?  The MIME standard seems to be silent on this AFAICS.
> 

the reason why I would use +json at the end is because it would be a subtype 
sufix: https://en.wikipedia.org/wiki/Media_type#Suffix 
<https://en.wikipedia.org/wiki/Media_type#Suffix> and most browsers will 
recognize it as json, and potentially format it...

> 
> Other then that the proposal looks good. can you start a PR with the spec 
> update?
> 
> I can do, but I don't hold out much hope of it getting merged. I started a PR 
> with a much more minor change <https://github.com/apache/avro/pull/738> 
> almost 2 months ago and haven't seen any response yet.

Send out a email on the dev mailing list, the committers seem more responsive 
lately...

> 
>   cheers,
>     rog.
> 
> —Z
> 
>> On Jan 15, 2020, at 12:30 PM, roger peppe > <mailto:rogpe...@gmail.com>> wrote:
>> 
>> On Wed, 15 Jan 2020 at 16:27, Zoltan Farkas > <mailto:zolyfar...@yahoo.com>> wrote:
>> See comments in-line below:
>> 
>>> On Jan 15, 2020, at 3:42 AM, roger peppe >> <mailto:rogpe...@gmail.com>> wrote:
>>> 
>>> Oops, I left arrays out! Two other thoughts: 
>>> 
>>> I wonder if it might be worth hedging bets about logical types. It would be 
>>> nice if (for example) a `timestamp-micros` value could be encoded as an 
>>> RFC3339 string, so perhaps that should be allowed for, but maybe that's a 
>>> step too far.
>> I think logical types should should stay above the encoding/decoding…  
>> With timestamp-micros we could extend it to make it applicable to string and 
>> implement the converters, and then in json you would have something 
>> readable, but you would then have the same in binary and pay the readability 
>> cost there as well.
>> 
>> I'm not sure what you mean there. I wouldn't expect the Avro binary format 
>> to be readable at all.
>> 
>> I implemented special handling for decimal logical type in my 
>> encoder/decoder, but the best implementation I could do still feels like a 
>> hack...
>> 
>>> I wonder if there should be some indication of version so that you know 
>>> which JSON encoding version you're reading. Perhaps the Avro schema could 
>>> include a version field (maybe as part of a definition) so you know which 
>>> version of the spec to use when encoding/decoding. Then bet-hedging 
>>> wouldn't be quite as important.
>> I think Schema needs to stay decoupled from the encoding. The same schema 
>> can be encoded in various ways (I have a csv encoder/decoder for example, 
>> https://demo.spf4j.org/example/records?_Accept=text/csv 
>> <https://demo.spf4j.org/example/records?_Accept=text/csv> ).
>> I think the right abstraction for what you are looking for is the Media 
>> Type(https://en.wikipedia.org/wiki/Media_type 
>> <https://en.wikipedia.org/wiki/Media_type> ), 
>> It would be helpful to “standardize” the media types for the avro encodings:
>> 
>> Yes, on reflection, I agree, even though not every possible medium has a 
>> media type. For example, what if we're storing JSON data in a file? I guess 
>> it would be up to us to store the type along with the data, as the registry 
>> message wire format 
>> <https://docs.confluent.io/current/schema-registry/serializer-formatter.html#wire-format>
>>  does, for example by wrapping the entire value in another JSON object.
>>  
>> Here is what I mean, (with some examples where the same schema is serv

Re: More idiomatic JSON encoding for unions

2020-01-15 Thread Zoltan Farkas
What I mean with timestamp-micros, is that it is currently restricted to being 
bound to long,
I see no reason why it should not be allowed to be bound to string as well. 
(the change should be simple to implement)

regarding the media type, something like: application/avro.2+json would be fine.

Other then that the proposal looks good. can you start a PR with the spec 
update?

—Z

> On Jan 15, 2020, at 12:30 PM, roger peppe  wrote:
> 
> On Wed, 15 Jan 2020 at 16:27, Zoltan Farkas  <mailto:zolyfar...@yahoo.com>> wrote:
> See comments in-line below:
> 
>> On Jan 15, 2020, at 3:42 AM, roger peppe > <mailto:rogpe...@gmail.com>> wrote:
>> 
>> Oops, I left arrays out! Two other thoughts: 
>> 
>> I wonder if it might be worth hedging bets about logical types. It would be 
>> nice if (for example) a `timestamp-micros` value could be encoded as an 
>> RFC3339 string, so perhaps that should be allowed for, but maybe that's a 
>> step too far.
> I think logical types should should stay above the encoding/decoding…  
> With timestamp-micros we could extend it to make it applicable to string and 
> implement the converters, and then in json you would have something readable, 
> but you would then have the same in binary and pay the readability cost there 
> as well.
> 
> I'm not sure what you mean there. I wouldn't expect the Avro binary format to 
> be readable at all.
> 
> I implemented special handling for decimal logical type in my 
> encoder/decoder, but the best implementation I could do still feels like a 
> hack...
> 
>> I wonder if there should be some indication of version so that you know 
>> which JSON encoding version you're reading. Perhaps the Avro schema could 
>> include a version field (maybe as part of a definition) so you know which 
>> version of the spec to use when encoding/decoding. Then bet-hedging wouldn't 
>> be quite as important.
> I think Schema needs to stay decoupled from the encoding. The same schema can 
> be encoded in various ways (I have a csv encoder/decoder for example, 
> https://demo.spf4j.org/example/records?_Accept=text/csv 
> <https://demo.spf4j.org/example/records?_Accept=text/csv> ).
> I think the right abstraction for what you are looking for is the Media 
> Type(https://en.wikipedia.org/wiki/Media_type 
> <https://en.wikipedia.org/wiki/Media_type> ), 
> It would be helpful to “standardize” the media types for the avro encodings:
> 
> Yes, on reflection, I agree, even though not every possible medium has a 
> media type. For example, what if we're storing JSON data in a file? I guess 
> it would be up to us to store the type along with the data, as the registry 
> message wire format 
> <https://docs.confluent.io/current/schema-registry/serializer-formatter.html#wire-format>
>  does, for example by wrapping the entire value in another JSON object.
>  
> Here is what I mean, (with some examples where the same schema is served with 
> different encodings):
> 
> 1) Binary: “application/avro” 
> https://demo.spf4j.org/example/records?_Accept=application/avro 
> <https://demo.spf4j.org/example/records?_Accept=application/avro>
> 2) Current Json: “application/avro+json" 
> https://demo.spf4j.org/example/records?_Accept=application/avro-x%2Bjson 
> <https://demo.spf4j.org/example/records?_Accept=application/avro+json>
> 3) New Json: “application/avro-x+json” ?  
> https://demo.spf4j.org/example/records?_Accept=application/avro-x%2Bjson 
> <https://demo.spf4j.org/example/records?_Accept=application/avro+json>
> 
> ISTM that "x" isn't a hugely descriptive qualifier there. How about 
> "application/avro+json.v2" ? Then it's clear what to do if we want to make 
> another version.
> 
>  
> The media type including the avro schema (like you can see in the response 
> ContentType in the headers above) can provide complete type  information to 
> be able to read a avro object from a byte stream.
> 
> application/avro-x+json;avsc="{\"type\":\"array\",\"items\":{\"$ref\":\"org.spf4j.demo:jaxrs-spf4j-demo-schema:0.8:b\"}}”
> 
> In HTTP context this fits well with content negotiation, and a client can ask 
> for a previous version like:
> 
> https://demo.spf4j.org/example/records/1?_Accept=application/json;avsc=%22{\%22$ref\%22:\%22org.spf4j.demo:jaxrs-spf4j-demo-schema:0.4:b\%22}%22
>  
> <https://demo.spf4j.org/example/records/1?_Accept=application/json;avsc=%22%7B%5C%22$ref%5C%22:%5C%22org.spf4j.demo:jaxrs-spf4j-demo-schema:0.4:b%5C%22%7D%22>
>  
> 
> Note on $ref,  it is an extension to avsc I use to reference schemas from 
> maven repos. (see 
> https://githu

Re: More idiomatic JSON encoding for unions

2020-01-15 Thread Zoltan Farkas
on is considered unambiguous if the JSON type sets for all the members 
> of the union form mutually disjoint sets. 
>  
> Note that float and double are considered ambiguous with respect to string 
> because in the future, Avro might support encoding NaN and infinity values as 
> strings.

LGTM, lets but this in a PR that covers the spec only.

> 
> On Tue, 14 Jan 2020 at 21:57, roger peppe  <mailto:rogpe...@gmail.com>> wrote:
> On Tue, 14 Jan 2020 at 19:26, Zoltan Farkas  <mailto:zolyfar...@yahoo.com>> wrote:
> Makes sense, 
> 
> We have to agree on he scope of this implementation.
> 
> Right now the implementation I have in java, handles only the:
> 
> union {null, [some type]} situation.
> 
> Are we ok with this for a start?
> 
> I'm not sure that it's worth publishing a half-way solution, as if people 
> start using it and a fuller solution is implemented, there will be three 
> incompatible standards, which isn't ideal.
> 
> What I see more, is to handle:
> 
> 1) union {string, double}, (although we have to specify behavior for NAN, 
> Positive and negative infinity);  union {string, boolean}; ….
> 
> My thought, as mentioned at the beginning of this thread, is to omit the 
> wrapping when all the members of the union encode to distinct JSON token 
> types (the JSON token types being: null, boolean, string, number, object and 
> array).
> 
> I think that we could probably leave out explicit mention of NaN and 
> infinity, as that's an issue with schemas too, and there's no obviously good 
> solution. That said, if we did want to solve the issue of NaN and infinity in 
> the future, things might get awkward with respect to this thread's proposal, 
> because it's likely that the only reasonable way to solve that issue is to 
> encode NaN and infinity as "NaN" and "±Infinity", which means that the union 
> ["string", "float"] becomes ambiguous if we leave out the type name for that 
> case.
> 
> It seems that it's not unheard-of to a string representation for these float 
> values (see https://issues.apache.org/jira/browse/AVRO-1290 
> <https://issues.apache.org/jira/browse/AVRO-1290>).
> 
> So perhaps we could define the format something like this:
>  
> JSON Encoding 
>  
> Except for unions, the JSON encoding is the same as is used to encode field 
> default values.
> The value of a union is encoded in JSON as follows:
> if all values of the union can be distinguished unambiguously (see below), 
> the JSON encoding is the same as is used to encode field default values for 
> the type
> otherwise it is encoded as a JSON object with one name/value pair whose name 
> is the type's name and whose value is the recursively encoded value. For 
> Avro's named types (record, fixed or enum) the user-specified name is used, 
> for other types the type name is used.
> Unambiguity is defined as follows: 
>  
> An Avro value can be encoded as one of a set of JSON types:
> null encodes as {null}
> boolean encodes as {boolean}
> int encodes as {number}
> long encodes as {number}
> float encodes as {number, string}
> double encodes as {number, string}
> bytes encodes as {string}
> string encodes as {string}
> any enum encodes as {string}
> any map encodes as {object}
> any record encodes as {object}
> A union is considered unambiguous if the JSON type sets for all the members 
> of the union form mutually disjoint sets. 
>  
> Note that float and double are considered ambiguous with respect to string 
> because in the future, Avro might support encoding NaN and infinity values as 
> strings.
> 
> WDYT?
> 
> 2) Make decimal an avro first class type. Current logical type approach is 
> not natural in JSON. (see https://issues.apache.org/jira/browse/AVRO-2164 
> <https://issues.apache.org/jira/browse/AVRO-2164>). 
> 
> For 1.9.x2) is probably a non-starter
> 
> Yes, this sounds a bit out of scope to me. It would be nice if decimal values 
> were represented as a human-readable decimal number (possibly a JSON string 
> to survive round-trips), but that should perhaps be part of a larger change 
> to improve decimal support in general. Interestingly, if we were to be able 
> to represent decimal values as JSON numbers (for example when they're 
> unambiguously representable as such), that would fit fine with the above 
> description, because bytes would be considered ambiguous with respect to 
> float.
> 
>   cheers,
> rog.



Re: More idiomatic JSON encoding for unions

2020-01-14 Thread Zoltan Farkas
Makes sense, 

We have to agree on he scope of this implementation.

Right now the implementation I have in java, handles only the:

union {null, [some type]} situation.

Are we ok with this for a start?

What I see more, is to handle:

1) union {string, double}, (although we have to specify behavior for NAN, 
Positive and negative infinity);  union {string, boolean}; ….

2) Make decimal an avro first class type. Current logical type approach is not 
natural in JSON. (see https://issues.apache.org/jira/browse/AVRO-2164).

For 1.9.x2) is probably a non-starter

let me know.

—Z


> On Jan 14, 2020, at 12:09 PM, roger peppe  wrote:
> 
> 
> On Tue, 14 Jan 2020 at 15:00, Zoltan Farkas  <mailto:zolyfar...@yahoo.com>> wrote:
> I can go ahead create a PR to add the Encoder/Decoder implementations.
> let me know if anyone else plans to do that. (to avoid wasting time)
> 
> Hi,
> 
> Before you do that, would it be possible to write a specification for exactly 
> what the conventions are and publish it somewhere? There are a bunch of edge 
> cases that could be done in different ways, I think.
> 
> That way people like me that don't use Java can implement the same spec. (and 
> also it's useful to know exactly what one is implementing before diving in 
> and writing the code :])
> 
>   cheers,
> rog.
> 
> 
> thanks
> 
> —Z
> 
>> On Jan 9, 2020, at 3:51 AM, Driesprong, Fokko > <mailto:fo...@driesprong.frl>> wrote:
>> 
>> Thanks for chipping in Zoltan and Sean. I did not plan to change the current 
>> JSON encoder. My initial suggestion would make this an option that the user 
>> can set. The default will be the current situation, so nothing should change 
>> when upgrading to a newer version of Avro.
>> 
>> Cheers, Fokko
>> 
>> Op wo 8 jan. 2020 om 21:39 schreef Sean Busbey > <mailto:bus...@apache.org>>:
>> I agree with Zoltan here. We have a really long history of maintaining 
>> compatibility for encoders.
>> 
>> On Tue, Jan 7, 2020 at 10:06 AM Zoltan Farkas > <mailto:zolyfar...@yahoo.com>> wrote:
>> Fokko, 
>> 
>> I am not sure we should be changing the existing json encoder,
>> I think we should just add another encoder, and devs can use either one of 
>> them based on their use case… and stay backward compatible.
>> 
>> we should maybe standardize the content types for them… I have seen 
>> application/avro being used for binary, we could have for json:
>> application/avro+json for the current format, application/avro.2+json for 
>> the new format…. 
>> 
>> At some point in the future we could deprecate the old one…
>> 
>> —Z
>> 
>> 
>>> On Jan 7, 2020, at 2:41 AM, Driesprong, Fokko >> <mailto:fo...@driesprong.frl>> wrote:
>>> 
>>> I would be a great fan of this as well. This also bothered me. The tricky 
>>> part here is to see when to release this because it will break the existing 
>>> JSON structure. We could make this configurable as well.
>>> 
>>> Cheers, Fokko
>>> 
>>> Op ma 6 jan. 2020 om 22:36 schreef roger peppe >> <mailto:rogpe...@gmail.com>>:
>>> That's great, thanks! I thought this would probably have come up before.
>>> 
>>> Have you written down your changes in a somewhat more formal specification 
>>> document, by any chance?
>>> 
>>>   cheers,
>>> rog.
>>> 
>>> 
>>> On Mon, 6 Jan 2020, 18:50 zoly farkas, >> <mailto:zolyfar...@yahoo.com>> wrote:
>>> I think there is consensus that this should be implemented, see [AVRO-1582] 
>>> Json serialization of nullable fileds and fields with default values 
>>> improvement. - ASF JIRA <https://issues.apache.org/jira/browse/AVRO-1582>
>>> 
>>> [AVRO-1582] Json serialization of nullable fileds and fields with defaul...
>>>  <https://issues.apache.org/jira/browse/AVRO-1582>
>>> 
>>> 
>>> Here is a live example to get some sample data in avro json: 
>>> https://demo.spf4j.org/example/records/1?_Accept=application/avro%2Bjson 
>>> <https://demo.spf4j.org/example/records/1?_Accept=application/avro%2Bjson>
>>> and the "Natural" 
>>> https://demo.spf4j.org/example/records/1?_Accept=application/json 
>>> <https://demo.spf4j.org/example/records/1?_Accept=application/json> using 
>>> the encoder suggested as implementation in the jira.
>>> 
>>> Somebody needs to find the time do the work to integrate this...
>>> 
>>> --Z
>>> 
>&

Re: More idiomatic JSON encoding for unions

2020-01-14 Thread Zoltan Farkas
I can go ahead create a PR to add the Encoder/Decoder implementations.
let me know if anyone else plans to do that. (to avoid wasting time)

thanks

—Z

> On Jan 9, 2020, at 3:51 AM, Driesprong, Fokko  wrote:
> 
> Thanks for chipping in Zoltan and Sean. I did not plan to change the current 
> JSON encoder. My initial suggestion would make this an option that the user 
> can set. The default will be the current situation, so nothing should change 
> when upgrading to a newer version of Avro.
> 
> Cheers, Fokko
> 
> Op wo 8 jan. 2020 om 21:39 schreef Sean Busbey  <mailto:bus...@apache.org>>:
> I agree with Zoltan here. We have a really long history of maintaining 
> compatibility for encoders.
> 
> On Tue, Jan 7, 2020 at 10:06 AM Zoltan Farkas  <mailto:zolyfar...@yahoo.com>> wrote:
> Fokko, 
> 
> I am not sure we should be changing the existing json encoder,
> I think we should just add another encoder, and devs can use either one of 
> them based on their use case… and stay backward compatible.
> 
> we should maybe standardize the content types for them… I have seen 
> application/avro being used for binary, we could have for json:
> application/avro+json for the current format, application/avro.2+json for the 
> new format…. 
> 
> At some point in the future we could deprecate the old one…
> 
> —Z
> 
> 
>> On Jan 7, 2020, at 2:41 AM, Driesprong, Fokko > <mailto:fo...@driesprong.frl>> wrote:
>> 
>> I would be a great fan of this as well. This also bothered me. The tricky 
>> part here is to see when to release this because it will break the existing 
>> JSON structure. We could make this configurable as well.
>> 
>> Cheers, Fokko
>> 
>> Op ma 6 jan. 2020 om 22:36 schreef roger peppe > <mailto:rogpe...@gmail.com>>:
>> That's great, thanks! I thought this would probably have come up before.
>> 
>> Have you written down your changes in a somewhat more formal specification 
>> document, by any chance?
>> 
>>   cheers,
>> rog.
>> 
>> 
>> On Mon, 6 Jan 2020, 18:50 zoly farkas, > <mailto:zolyfar...@yahoo.com>> wrote:
>> I think there is consensus that this should be implemented, see [AVRO-1582] 
>> Json serialization of nullable fileds and fields with default values 
>> improvement. - ASF JIRA <https://issues.apache.org/jira/browse/AVRO-1582>
>> 
>> [AVRO-1582] Json serialization of nullable fileds and fields with defaul...
>>  <https://issues.apache.org/jira/browse/AVRO-1582>
>> 
>> 
>> Here is a live example to get some sample data in avro json: 
>> https://demo.spf4j.org/example/records/1?_Accept=application/avro%2Bjson 
>> <https://demo.spf4j.org/example/records/1?_Accept=application/avro%2Bjson>
>> and the "Natural" 
>> https://demo.spf4j.org/example/records/1?_Accept=application/json 
>> <https://demo.spf4j.org/example/records/1?_Accept=application/json> using 
>> the encoder suggested as implementation in the jira.
>> 
>> Somebody needs to find the time do the work to integrate this...
>> 
>> --Z
>> 
>> 
>> 
>> 
>> On Monday, January 6, 2020, 12:36:44 PM EST, roger peppe > <mailto:rogpe...@gmail.com>> wrote:
>> 
>> 
>> Hi,
>> 
>> The JSON encoding in the specification 
>> <https://avro.apache.org/docs/current/spec.html#json_encoding> includes an 
>> explicit type name for all kinds of object other than null. This means that 
>> a JSON-encoded Avro value with a union is very rarely directly compatible 
>> with normal JSON formats.
>> 
>> For example, it's very common for a JSON-encoded value to allow a value 
>> that's either null or string. In Avro, that's trivially expressed as the 
>> union type ["null", "string"]. With conventional JSON, a string value "foo" 
>> would be encoded just as "foo", which is easily distinguished from null when 
>> decoding. However when using the Avro JSON format it must be encoded as 
>> {"string": "foo"}.
>> 
>> This means that Avro JSON-encoded values don't interchange easily with other 
>> JSON-encoded values.
>> 
>> AFAICS the main reason that the type name is always required in JSON-encoded 
>> unions is to avoid ambiguity. This particularly applies to record and map 
>> types, where it's not possible in general to tell which member of the union 
>> has been specified by looking at the data itself.
>> 
>> However, that reasoning doesn't apply if all the members of the union can be 
>> distinguished f

Re: More idiomatic JSON encoding for unions

2020-01-07 Thread Zoltan Farkas
Fokko, 

I am not sure we should be changing the existing json encoder,
I think we should just add another encoder, and devs can use either one of them 
based on their use case… and stay backward compatible.

we should maybe standardize the content types for them… I have seen 
application/avro being used for binary, we could have for json:
application/avro+json for the current format, application/avro.2+json for the 
new format…. 

At some point in the future we could deprecate the old one…

—Z


> On Jan 7, 2020, at 2:41 AM, Driesprong, Fokko  wrote:
> 
> I would be a great fan of this as well. This also bothered me. The tricky 
> part here is to see when to release this because it will break the existing 
> JSON structure. We could make this configurable as well.
> 
> Cheers, Fokko
> 
> Op ma 6 jan. 2020 om 22:36 schreef roger peppe  >:
> That's great, thanks! I thought this would probably have come up before.
> 
> Have you written down your changes in a somewhat more formal specification 
> document, by any chance?
> 
>   cheers,
> rog.
> 
> 
> On Mon, 6 Jan 2020, 18:50 zoly farkas,  > wrote:
> I think there is consensus that this should be implemented, see [AVRO-1582] 
> Json serialization of nullable fileds and fields with default values 
> improvement. - ASF JIRA 
> 
> [AVRO-1582] Json serialization of nullable fileds and fields with defaul...
>  
> 
> 
> Here is a live example to get some sample data in avro json: 
> https://demo.spf4j.org/example/records/1?_Accept=application/avro%2Bjson 
> 
> and the "Natural" 
> https://demo.spf4j.org/example/records/1?_Accept=application/json 
>  using the 
> encoder suggested as implementation in the jira.
> 
> Somebody needs to find the time do the work to integrate this...
> 
> --Z
> 
> 
> 
> 
> On Monday, January 6, 2020, 12:36:44 PM EST, roger peppe  > wrote:
> 
> 
> Hi,
> 
> The JSON encoding in the specification 
>  includes an 
> explicit type name for all kinds of object other than null. This means that a 
> JSON-encoded Avro value with a union is very rarely directly compatible with 
> normal JSON formats.
> 
> For example, it's very common for a JSON-encoded value to allow a value 
> that's either null or string. In Avro, that's trivially expressed as the 
> union type ["null", "string"]. With conventional JSON, a string value "foo" 
> would be encoded just as "foo", which is easily distinguished from null when 
> decoding. However when using the Avro JSON format it must be encoded as 
> {"string": "foo"}.
> 
> This means that Avro JSON-encoded values don't interchange easily with other 
> JSON-encoded values.
> 
> AFAICS the main reason that the type name is always required in JSON-encoded 
> unions is to avoid ambiguity. This particularly applies to record and map 
> types, where it's not possible in general to tell which member of the union 
> has been specified by looking at the data itself.
> 
> However, that reasoning doesn't apply if all the members of the union can be 
> distinguished from their JSON token type.
> 
> I am considering using a JSON encoding that omits the type name when all the 
> members of the union encode to distinct JSON token types (the JSON token 
> types being: null, boolean, string, number, object and array).
> 
> For example, JSON-encoded values using the Avro schema ["null", "string", 
> "int"] would encode as the literal values themselves (e.g. null, "foo", 999), 
> but JSON-encoded values using the Avro schema ["int", "double"] would require 
> the type name because the JSON lexeme doesn't distinguish between different 
> kinds of number.
> 
> This would mean that it would be possible to represent a significant subset 
> of "normal" JSON schemas with Avro. It seems to me that would potentially be 
> very useful.
> 
> Thoughts? Is this a really bad idea to be contemplating? :)
> 
>   cheers,
> rog.
> 
> 



Re: name-agnostic schema resolution (a.k.a. structural subtyping?)

2019-12-20 Thread Zoltan Farkas
Hi Roger,

have you considered  leveraging  avro logical types, and keep the payload and 
event metadata “separate”?

Here is a example (will use avro idl, since that is more readable to me :-) ):

record MetaData {
@logicalType(“instant") string timeStamp;
….. all the meta data fields...
}

record CloudEvent {

MetaData metaData;

Any payload;

}

@logicalType(“any")
record Any {

/** here you have the schema of the data, for efficiency, you can use a 
schema id + schema repo, or something like 
https://github.com/zolyfarkas/jaxrs-spf4j-demo/wiki/AvroReferences 
 */
string schema;

bytes data;

}

this way a system that is interested in the metadata does not even have to 
deserialize the payload….

hope it helps.

—Z


> On Dec 18, 2019, at 11:49 AM, roger peppe  wrote:
> 
> Hi,
> 
> Background: I've been contemplating the proposed Avro format in the 
> CloudEvent specification 
> , which 
> defines standard metadata for events. It defines a very generic format for an 
> event that allows storage of almost any data. It seems to me that by going in 
> that direction it's losing almost all the advantages of using Avro in the 
> first place. It feels like it's trying to shoehorn a dynamic message format 
> like JSON into the Avro format, where using Avro itself could do so much 
> better.
> 
> I'm hoping to propose something better. I had what I thought was a nice idea, 
> but it doesn't quite work, and I thought I'd bring up the subject here and 
> see if anyone had some better ideas.
> 
> The schema resolution 
>  part of 
> the spec allows a reader to read a schema that was written with extra fields. 
> So, theoretically, we could define a CloudEvent something like this:
> 
> {
> "name": "CloudEvent",
> "type": "record",
> "fields": [{
> "name": "Metadata",
> "type": {
> "type": "record",
> "name": "CloudEvent",
> "namespace": "avro.apache.org ",
> "fields": [{
> "name": "id",
> "type": "string"
> }, {
> "name": "source",
> "type": "string"
> }, {
> "name": "time",
> "type": "long",
> "logicalType": "timestamp-micros"
> }]
> }
> }]
> }
> 
> Theoretically, this could enable any event that's a record that has at least 
> a Metadata field with the above fields to be read generically. The CloudEvent 
> type above could be seen as a structural supertype of all possible 
> more-specific CloudEvent-compatible records that have such a compatible field.
> 
> This has a few nice advantages:
> - there's no need for any wrapping of payload data.
> - the CloudEvent type can evolve over time like any other Avro type.
> - all the data message fields are immediately available alongside the 
> metadata.
> - there's still exactly one schema for a topic, encapsulating both the 
> metadata and the payload.
> 
> However, this idea fails because of one problem - this schema resolution 
> rule: "both schemas are records with the same (unqualified) name". This means 
> that unless everyone names all their CloudEvent-compatible records 
> "CloudEvent", they can't be read like this.
> 
> I don't think people will be willing to name all their records "CloudEvent", 
> so we have a problem.
> 
> I can see a few possible workarounds:
> when reading the record as a CloudEvent, read it with a schema that's the 
> same as CloudEvent, but with the top level record name changed to the top 
> level name of the schema that was used to write the record.
> ignore record names when matching schema record types.
> allow aliases to be specified when writing data as well as reading it. When 
> defining a CloudEvent-compatible event, you'd add a CloudEvent alias to your 
> record.
> None of the options are particularly nice. 1 is probably the easiest to do, 
> although means you'd still need some custom logic when decoding records, 
> meaning you couldn't use stock decoders.
> 
> I like the idea of 2, although it gets a bit tricky when dealing with union 
> types. You could define the matching such that it ignores names only when the 
> two matched types are unambiguous (i.e. only one record in both). This could 
> be implemented as an option ("use structural typing") when decoding.
> 
> 3 is probably cleanest but interacts significantly with the spec (for 
> example, the canonical schema transformation strips aliases out, but they'd 
> need to be retained).
> 
> Any thoughts? Is this a silly thing to be contemplating? Is there a better 
> way?
> 
>   cheers,
>

Re: Deserialize list of JSON-encoded records with evolved Schema

2019-12-02 Thread Zoltan Farkas
The error suggests that you are attempt to parse a message encoded with 
TestRecordV1 and you use TestRecordV2 as writer schema instead of TestRecordV1.


make sure when you de-serialize an TestRecordV1 array into a TestRecordV2 
array, you initialize your Json decoder with the writer schema not the reader 
one:

>  Decoder decoder = DECODER_FACTORY.jsonDecoder(readArrSchema writeArrSchema, 
> jsonArrayString);

hope it helps.

—Z



> On Dec 1, 2019, at 8:16 PM, Austin Cawley-Edwards  
> wrote:
> 
> Hi,
> 
> We are trying to encode a list of records into JSON with one schema and then 
> decode the list into Avro objects with a compatible schema. The schema 
> resolution between the two schemas works for single records, but the 
> deserialization fails when the read schema differs from the write. 
> Deserialization works, however, when the same schema is used for both.
> 
> When decoding, an exception is thrown:
> 
> org.apache.avro.AvroTypeException: Attempt to process a item-end when a int 
> was expected.
>org.apache.avro.io.parsing.Parser.advance(Parser.java:93)
>org.apache.avro.io.JsonDecoder.advance(JsonDecoder.java:139)
>org.apache.avro.io.JsonDecoder.arrayNext(JsonDecoder.java:360)
> 
> It seems like the decoder is not moving the proper number of bytes down to 
> read the next element.
> 
> We encode like so:
> 
> public static  String toJSONArrayString(List 
> avroRecords, Schema schema) throws IOException {
> 
>   if (avroRecords == null || avroRecords.isEmpty()) {
> return "[]";
>   }
> 
>   ByteArrayOutputStream baos = new ByteArrayOutputStream();
>   Encoder encoder = ENCODER_FACTORY.jsonEncoder(Schema.createArray(schema), 
> baos);
>   DatumWriter datumWriter = avroRecords.get(0) instanceof SpecificRecord
>   ? new SpecificDatumWriter<>(schema)
>   : new GenericDatumWriter<>(schema);
> 
>   encoder.writeArrayStart();
>   encoder.setItemCount(avroRecords.size());
>   for (T record : avroRecords) {
> encoder.startItem();
> datumWriter.write(record, encoder);
>   }
>   encoder.writeArrayEnd();
>   encoder.flush();
> 
>   return baos.toString();
> }
> 
> And decode similarly:
> public static  List fromJSONArrayString(String 
> jsonArrayString, Schema writeSchema, Schema readSchema) throws IOException {
>   Schema readArrSchema = Schema.createArray(readSchema);
>   Decoder decoder = DECODER_FACTORY.jsonDecoder(readArrSchema, 
> jsonArrayString);
>   DatumReader datumReader;
>   if (writeSchema.equals(readSchema)) {
> datumReader = new SpecificDatumReader<>(readSchema);
>   } else {
> datumReader = new SpecificDatumReader<>(writeSchema, readSchema);
>   }
> 
>   List avroRecords = new ArrayList<>();
>   for (long i = decoder.readArrayStart(); i != 0; i = decoder.arrayNext()) {
> for (long j = 0; j < i; j++) {
>   avroRecords.add(datumReader.read(null, decoder));
> }
>   }
> 
>   return avroRecords;
> }
> 
> 
> Our two schemas look like:
> {
>   "type": "record",
>   "name": "TestRecordV1",
>   "fields": [
> {
>   "name": "text",
>   "type": "string"
> }
>   ]
> }
> {
>   "type": "record",
>   "name": "TestRecordV2",
>   "fields": [
> {
>   "name": "text",
>   "type": "string"
> },
> {
>   "name": "number",
>   "type": "int",
>   "default": 0
> }
>   ]
> }
> 
> 
> Is there something simple we are missing or is it not possible to do schema 
> resolution dynamically on an entire array?
> 
> Thank you!
> Austin
> 
> 



Re: Should a Schema be serializable in Java?

2019-07-18 Thread Zoltan Farkas
LGTM

> On Jul 18, 2019, at 8:24 AM, Driesprong, Fokko  wrote:
> 
> Thank you Ryan, I have a few comments on Github. Looks good to me.
> 
> Cheers, Fokko
> 
> Op do 18 jul. 2019 om 11:58 schreef Ryan Skraba  >:
> Hello!  I'm motivated to see this happen :D
> 
> +Zoltan, the original author.  I created a PR against apache/avro master 
> here: https://github.com/apache/avro/pull/589 
> 
> 
> I cherry-picked the commit from your fork, and reapplied spotless/checkstyle. 
>  I hope this is the correct way to preserve authorship and that I'm not 
> stepping on any toes!
> 
> Can someone take a look at the above PR?  
> 
> Best regards, 
> 
> Ryan
> 
> On Tue, Jul 16, 2019 at 11:58 AM Ismaël Mejía  > wrote:
> Yes probably it is overkill to warn given the examples you mention.
> Also your argument towards reusing the mature (and battle tested)
> combination of Schema.Parser + String serialization makes sense.
> 
> Adding this to 1.9.1 will be an extra selling point for projects
> wanting to migrate to the latest version of Avro so it sounds good to
> me but you should add it to master and then we can cherry pick it from
> there.
> 
> 
> On Tue, Jul 16, 2019 at 11:16 AM Ryan Skraba  > wrote:
> >
> > Hello!  Thanks to the reference to AVRO-1852. It's exactly what I was 
> > looking for.
> >
> > I agree that Java serialization shouldn't be used for anything 
> > cross-platform, or (in my opinion) used for any data persistence at all.  
> > Especially not for an Avro container file or sending binary data through a 
> > messaging system...
> >
> > But Java serialization is definitely useful and used for sending instances 
> > of "distributed work" implemented in Java from node to node in a cluster.  
> > I'm not too worried about existing connectors -- we can see that each 
> > framework has "solved" the problem one at a time.  In addition to Flink, 
> > there's 
> > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroUtils.java#L29
> >  
> > 
> >  and 
> > https://github.com/apache/spark/blob/3663dbe541826949cecf5e1ea205fe35c163d147/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOutputWriterFactory.scala#L35
> >  
> > .
> >
> > Specifically, I see the advantage for user-defined distributed functions 
> > that happen to carry along an Avro Schema -- and I can personally say that 
> > I've encountered this a lot in our code!
> >
> > That being said, I think it's probably overkill to warn the user about the 
> > perils of Java serialization (not being cross-language and requiring 
> > consistent JDKs and libraries across JVMs).  If an error occurs for one of 
> > those reasons, there's a larger problem for the dev to address, and it's 
> > just as likely to occur for any Java library in the job if the environment 
> > is bad.  Related, we've encountered similar issues with logical types 
> > existing in Avro 1.8 in the driver but not in Avro 1.7 on the cluster... 
> > the solution is "make sure you don't do that".  (Looking at you, guava and 
> > jackson!)
> >
> > The patch in question delegates serialization to the string form of the 
> > schema, so it's basically doing what all of the above Avro "holders" are 
> > doing -- I wouldn't object to having a sample schema available that fully 
> > exercises what a schema can hold, but I also think that Schema.Parser (used 
> > underneath) is currently pretty well tested and mature!
> >
> > Do you think this could be a candidate for 1.9.1 as a minor improvement?  I 
> > can't think of any reason that this wouldn't be backwards compatible.
> >
> > Ryan
> >
> > side note: I wrote java.lang.Serializable earlier, which probably didn't 
> > help my search for prior discussion... :/
> >
> > On Tue, Jul 16, 2019 at 9:59 AM Ismaël Mejía  > > wrote:
> >>
> >> This is a good idea even if it may have some issues that we should
> >> probably document and warn users about:
> >>
> >> 1. Java based serialization is really practical for JVM based systems,
> >> but we should probably add a warning or documentation because Java
> >> serialization is not deterministic between JVMs so this could be a
> >> source for issues (usually companies use the same version of the JVM
> >> so this is less critical, but this still can happen specially now with
> >> all the different versions of Java and OpenJDK based flavors).
> >>
> >> 2. This is not cross language compatible, the String based
> >> representation (or even an Avro based representation of Schema) can be
> >> used in every language.
> >>
> >> Even with these I think 

Re: Aliases with Forward Compatibility

2019-06-16 Thread Zoltan Farkas
Fork is here: https://github.com/zolyfarkas/avro 
<https://github.com/zolyfarkas/avro>  
When i have time, I try to do PRs against official to make sure things are not 
too far apart. 
At minimum I make sure I file a JIRA so that when I have a time I can work on  
a PR. I appreciate any help with PRs against official.

Currently field name aliases should word the same as in official. Enum symbol 
aliases is something that exists only in my fork.

let me know if you have any questions.

—Z

> On Jun 15, 2019, at 3:18 PM, Aaron Dixon  wrote:
> 
> Thank you Zoltan. Is your fork publicly available, could I take a look at it?
> 
> On Sat, Jun 15, 2019 at 5:30 AM Zoltan Farkas  <mailto:zolyfar...@yahoo.com>> wrote:
> I agree with your understanding of how aliases should work, and a lot of 
> developers I interact with expect that aliases should work this way.
> When I implemented https://issues.apache.org/jira/browse/AVRO-1752 
> <https://issues.apache.org/jira/browse/AVRO-1752> in my avro fork I 
> implemented the resolution the way your describe it.
> 
> I see no reason why this could not be implemented as part of 2.0… but I would 
> let others  with more authority chime in.
> 
> —Z
> 
> 
> 
>> On Jun 14, 2019, at 10:01 PM, Aaron Dixon > <mailto:atdi...@gmail.com>> wrote:
>> 
>> I asked this question on the dev list, but didn't get a response here. (My 
>> original question to the dev list: 
>> https://sematext.com/opensee/m/F2svI1cI2oW1CwdmF1?subj=readers+using+writer+s+aliases+
>>  
>> <https://sematext.com/opensee/m/F2svI1cI2oW1CwdmF1?subj=readers+using+writer+s+aliases+>)
>> 
>> It also seems this question was asked before in late 2018, but dead-ended at 
>> https://sematext.com/opensee/m/Avro/F2svI1obxDi4WGqf1?subj=Re+Alias+with+Backward+Compatibility
>>  
>> <https://sematext.com/opensee/m/Avro/F2svI1obxDi4WGqf1?subj=Re+Alias+with+Backward+Compatibility>
>> 
>> Avro aliases are typically used by *reader* schemas to rename fields. (I.e., 
>> readers can expect "first-name" string and use an alias "firts-name" to deal 
>> with old writer's that had it mispelled in the original writer schema.) This 
>> is backwards compatibility (new readers can read old writers).
>> 
>> However we would like to not have to update reader code to deal with new 
>> writers (ie we want *forward* compatibility with aliases). It seems that 
>> this should be easy: (old) readers could look at the new writer-defined 
>> aliases and leverage them for forward compatibility while doing schema 
>> resolution.
>> 
>> Concrete example: my old schema expects "firts-name"; my new schema fixes 
>> this by introducing "first-name" with the "firts-name" as an alias. Instead 
>> of being obligated to update my old reader(s), couldn't the schema 
>> resolution logic notice this alias and *invert* the aliasing as it reads the 
>> data to give the old reader the field it expects?
>> 
>> Is there a fundamental reason that this isn't part of the avro java impl, or 
>> spec documentation? Not having to coordinate updates to readers during 
>> schema evolution (field renames) would be a huge win imo.
> 



Re: Aliases with Forward Compatibility

2019-06-15 Thread Zoltan Farkas
I agree with your understanding of how aliases should work, and a lot of 
developers I interact with expect that aliases should work this way.
When I implemented https://issues.apache.org/jira/browse/AVRO-1752 
 in my avro fork I implemented 
the resolution the way your describe it.

I see no reason why this could not be implemented as part of 2.0… but I would 
let others  with more authority chime in.

—Z



> On Jun 14, 2019, at 10:01 PM, Aaron Dixon  wrote:
> 
> I asked this question on the dev list, but didn't get a response here. (My 
> original question to the dev list: 
> https://sematext.com/opensee/m/F2svI1cI2oW1CwdmF1?subj=readers+using+writer+s+aliases+
>  
> )
> 
> It also seems this question was asked before in late 2018, but dead-ended at 
> https://sematext.com/opensee/m/Avro/F2svI1obxDi4WGqf1?subj=Re+Alias+with+Backward+Compatibility
>  
> 
> 
> Avro aliases are typically used by *reader* schemas to rename fields. (I.e., 
> readers can expect "first-name" string and use an alias "firts-name" to deal 
> with old writer's that had it mispelled in the original writer schema.) This 
> is backwards compatibility (new readers can read old writers).
> 
> However we would like to not have to update reader code to deal with new 
> writers (ie we want *forward* compatibility with aliases). It seems that this 
> should be easy: (old) readers could look at the new writer-defined aliases 
> and leverage them for forward compatibility while doing schema resolution.
> 
> Concrete example: my old schema expects "firts-name"; my new schema fixes 
> this by introducing "first-name" with the "firts-name" as an alias. Instead 
> of being obligated to update my old reader(s), couldn't the schema resolution 
> logic notice this alias and *invert* the aliasing as it reads the data to 
> give the old reader the field it expects?
> 
> Is there a fundamental reason that this isn't part of the avro java impl, or 
> spec documentation? Not having to coordinate updates to readers during schema 
> evolution (field renames) would be a huge win imo.



Re: is it possible to deserialize JSON with optional field?

2019-04-16 Thread Zoltan Farkas
Please don’t insinuate any “promotion” efforts from my side. 
I am simply trying to help, and insinuations like this can only hinder 
participation and not help the project in any way.

You don’t have to use my fork, just lift the code out into your project or your 
fork. (Just look at it as a “gist”).

If you want to have this to be part of the avro library, you can contribute a 
PR.
There is a JIRA for this already: 
https://issues.apache.org/jira/browse/AVRO-1582 
<https://issues.apache.org/jira/browse/AVRO-1582> 

I can speak from experience that the community is welcoming to 
contributors/contributions.

—Z


> On Apr 16, 2019, at 5:54 AM, Martin Mucha  wrote:
> 
> Hi, thanks for responding.
> 
> I know that you promote your fork, however considering I might not be able to 
> move away from "official release", is there an easy way how to consume this? 
> Since I cannot see it ...
> 
> Maybe side question: official avro seems to be dead. There are some commits 
> made, but last release happened 2 years ago, fatal flaws are not being 
> addressed, almost 10 years old valid bug reports are just ignored, ... Does 
> anyone know about any sign/confirmation that avro community will be moving 
> toward something more viable?
> 
> M.
> 
> po 15. 4. 2019 v 15:17 odesílatel Zoltan Farkas  <mailto:zolyfar...@yahoo.com>> napsal:
> It is possible to do it with a custom JsonDecoder.
> 
> I wrote one that does this at:  
> https://github.com/zolyfarkas/avro/blob/trunk/lang/java/avro/src/main/java/org/apache/avro/io/ExtendedJsonDecoder.java
>  
> <https://github.com/zolyfarkas/avro/blob/trunk/lang/java/avro/src/main/java/org/apache/avro/io/ExtendedJsonDecoder.java>
>  
> 
> hope it helps.
> 
> 
> —Z
> 
>> On Apr 13, 2019, at 9:24 AM, Martin Mucha > <mailto:alfon...@gmail.com>> wrote:
>> 
>> Hi, 
>> 
>> is it possible by design to deserialize JSON with schema having optional 
>> value?
>> Schema:
>> 
>> {
>>  "type" : "record",
>>  "name" : "UserSessionEvent",
>>  "namespace" : "events",
>>  "fields" : [ {
>>"name" : "username",
>>"type" : "string"
>>  }, {
>>"name" : "errorData",
>>"type" : [ "null", "string" ],
>>"default" : null
>>  }]
>> }
>> Value to deserialize:
>> {"username" : "2271AE67-34DE-4B43-8839-07216C5D10E1"}
>> I also tried to change order of type, but that changed nothing. I know I can 
>> produce ill-formated JSON which could be deserialized, but that's not 
>> acceptable. AFAIK given JSON with required `username` and optional 
>> `errorData` cannot be deserialized by design. Am I right?
>> 
>> thanks.
> 



Re: is it possible to deserialize JSON with optional field?

2019-04-15 Thread Zoltan Farkas
It is possible to do it with a custom JsonDecoder.

I wrote one that does this at:  
https://github.com/zolyfarkas/avro/blob/trunk/lang/java/avro/src/main/java/org/apache/avro/io/ExtendedJsonDecoder.java
 

 

hope it helps.


—Z

> On Apr 13, 2019, at 9:24 AM, Martin Mucha  wrote:
> 
> Hi, 
> 
> is it possible by design to deserialize JSON with schema having optional 
> value?
> Schema:
> 
> {
>  "type" : "record",
>  "name" : "UserSessionEvent",
>  "namespace" : "events",
>  "fields" : [ {
>"name" : "username",
>"type" : "string"
>  }, {
>"name" : "errorData",
>"type" : [ "null", "string" ],
>"default" : null
>  }]
> }
> Value to deserialize:
> {"username" : "2271AE67-34DE-4B43-8839-07216C5D10E1"}
> I also tried to change order of type, but that changed nothing. I know I can 
> produce ill-formated JSON which could be deserialized, but that's not 
> acceptable. AFAIK given JSON with required `username` and optional 
> `errorData` cannot be deserialized by design. Am I right?
> 
> thanks.



Re: new release with fix for AVRO-1723?

2018-12-02 Thread Zoltan Farkas
I am building my avro fork (https://github.com/zolyfarkas/avro 
 ) and publishing it to bintray: 
https://bintray.com/zolyfarkas/core/avro/1.8.1.25p 
 … it is one option you 
have that is not difficult to implement (took me about 1h)...

I am  doing this approach with most open source libraries…
Depending on open source project, the turnaround for a defect/enhancement can 
be between 1 week -> 1 year, and my deadlines are not that generous...

hope it helps...

—Z


> On Nov 30, 2018, at 6:20 PM, David Carlton  wrote:
> 
> I'm running into https://issues.apache.org/jira/browse/AVRO-1723 
>  (forward declarations in 
> Avro IDL), and I'm wondering what the timing is for a release that contains 
> that fix?  I see https://issues.apache.org/jira/browse/AVRO-2163 
>  for releasing 1.8.3, but 
> it's not clear from that Jira what the timeline is for 1.8.3 and whether it 
> will contain a fix for AVRO-1723.  So I'm trying to figure out if I should 
> generate my own local build of Avro containing the patch for AVRO-1723, or if 
> I should just write that protocol in JSON instead of IDL and then switch it 
> over to IDL once 1.8.3 is released.
> 
> Thanks for any advice you have,
> David Carlton
> carl...@sumologic.com 


Re: Enum order change

2018-05-30 Thread Zoltan Farkas
They will be compatible as long as you use the writer and reader schemas to 
decode the message...
(The symbol names are used to map between your writer and reader schema…)

—Z

> On May 30, 2018, at 8:35 AM, Arnaud BOS  wrote:
> 
> Quick question: is reordering the values of an enum a backward compatible 
> change?
> 
> Per the Avro 1.8.2 documentation: "An enum is encoded by a int, representing 
> the zero-based position of the symbol in the schema."
> 
> If my enum is: 
> {"type": "enum", "name": "Foo", "symbols": ["A", "B", "C", "D"] }
> And I update it to
> {"type": "enum", "name": "Foo", "symbols": ["A", "B", "C", "E", "F", "D"] }
> 
> If a writer with the older schema will produce messages using enum value "D" 
> encoded as int 3.
> Then a reader with the newer schema will consume messages with value 3 and 
> decoded it as enum value "E".
> 
> If this is the case, then reordering enums is just as non-backward compatible 
> as adding new values is non-forward compatible, is it correct?



Re: Enums limited to symbols

2018-05-30 Thread Zoltan Farkas
I view this as a current implementation limitation… 
which we eliminated in our avro fork by adding the ability to map a arbitrary 
string value to a symbol like:

@stringSymbols({“MY_SYMB”: “MY NON ID COMPLIANT VALUE"})
enum MyEnum {
 A, B, MY_SYMB
}

there is probably other ways to “resolve’ this...

—Z



> On May 30, 2018, at 9:53 AM, Michael A. Smith  wrote:
> 
> Hi, we recently ran into AVRO-2174 in the sense that we were using Python, 
> and had enums with spaces in some symbols. Now have to do an integration with 
> a Java avro system, and have to make some hard choices because Java won't 
> accept our enums.
> 
> It would help me make some of these choices if I understood if there is a 
> technical motivation for not allowing enum symbols to be more arbitrary 
> strings. I have read AVRO-1725, but IIUC the decision was made to have the 
> spec follow the Java behavior (introducing this break with other 
> implementations) starting in 1.8.0.
> 
> So, digging deeper, does the data or encoding need this behavior? I'm 
> thinking along these lines:
> 
> - Are we expected to only use the enum symbols to refer to other names 
> defined in the schema?
> - Do spaceless strings lend themselves to a more compact binary encoding?
> 



Re: Tool for creating uml diagrams

2018-04-09 Thread Zoltan Farkas
I stumbled a while ago across: https://github.com/malisas/schema-uml 
 

there might be more  tools out there...

—Z

> On Apr 9, 2018, at 4:46 PM, David Espinosa  wrote:
> 
> Hi Mark,
> Thanks for your response. 
> Staruml is a great tool for creating uml diagrams, but want I want is to be 
> able to automatically update my data model diagram once my avro files (which 
> are the "source of the truth") changes. 
> 
> Thanks again!
> 
> 2018-04-09 19:23 GMT+02:00 Mark Grey  >:
> Don't have one specifically for avro files, but StarUML is the best tool I've 
> found for just creating the diagrams from a UI.
> 
> http://staruml.io/ 
> 
> On Mon, Apr 9, 2018 at 1:15 PM, David Espinosa  > wrote:
> Hi all,
> I was wondering if somebody knows about some tool to create uml diagrams from 
> avro files.
> 
> Thanks in advance!
> David
> 
> 



Re: Avro schema properties contention on multithread read

2017-07-08 Thread Zoltan Farkas
The order of attributes in Json might matter as far as I can remember, so 
LinkedHashMap might not be replaceable with a concurrenthashmap.
Plus concurrenthashmap is not exactly without concurrency overhead…
I wrote a util that creates a immutable schema 
https://github.com/zolyfarkas/spf4j/blob/master/spf4j-avro/src/main/java/org/spf4j/avro/schema/Schemas.java#L26
 
<https://github.com/zolyfarkas/spf4j/blob/master/spf4j-avro/src/main/java/org/spf4j/avro/schema/Schemas.java#L26>
 
But you would have to use it it conjunction with a unsynchronized avro 
implementation. (which I do in my fork, and you can do as well).

I wonder if there is interest in merging this into the avro lib someday.

—Z


> On Jul 6, 2017, at 12:20 PM, f...@legsem.com wrote:
> 
> On 05.07.2017 21:53, Zoltan Farkas wrote:
> 
>> The synchronization in JsonProperties is curently inconsistent (see 
>> getObjectProps()) which makes current implementation @NotThreadSafe
>>  
>> I think it would be probably best to remove synchronization from those 
>> methods... and add @NotThreadSafe to the class...
>> Utilities like Schemas.synchronizedSchema(...) and 
>> Schemas.unmodifiableSchema(...) could be added to help with various use 
>> cases...
>>  
>>  
>> —Z
>>  
> Thank you for your reply. I like your Schemas.unmodifiableSchema(...) a lot.
> 
> While what you are describing would be ideal, a simpler solution might be to 
> change the LinkedHashMap that backs jsonProperties into something like a 
> ConcurrentHashMap, avoiding the need for synchronization.
> 
> This being said ConcurrentHashMap itself does not preserve insertion order, 
> so its not a mere replacement to LinkedHashMap.
> 



Re: Avro schema properties contention on multithread read

2017-07-05 Thread Zoltan Farkas
The synchronization in JsonProperties is curently inconsistent (see 
getObjectProps()) which makes current implementation @NotThreadSafe

I think it would be probably best to remove synchronization from those methods… 
and add @NotThreadSafe to the class…
Utilities like Schemas.synchronizedSchema(…) and Schemas.unmodifiableSchema(…) 
could be added to help with various use cases…


—Z


> On Jun 29, 2017, at 2:21 AM, f...@legsem.com wrote:
> 
> Hello,
> 
> We are using Avro Schema properties and while running concurrent tests, we 
> noticed a lot of contentions on org.apache.avro.JsonProperties#getJsonProp.
> 
> In the attached screen shot, we have 4 concurrent threads all sharing the 
> same avro schema and reading from it simultaneously.
> 
> On this screen shot each red period is a contention between threads. Most of 
> these contentions are on getJsonProp.
> 
> This is due to getJsonProp being a synchronized method.
> 
> We have tried avro 1.7.7, 1.8.1 and 1.8.2. All have this problem (getJsonProp 
> is deprecated in 1.8 but the replacement method is also synchronized).
> 
> We can work around this by not sharing the avro schemas between threads 
> (using ThreadLocal for instance) but this is ugly.
> 
> It seems that avro schemas are mostly immutable, which is great for 
> multithread read access, but it turns out Properties within these schemas are 
> mutable and, since they are stored in a LinkedHashMap, synchronization is 
> necessary.
> 
> Anyone having a similar issue?
> 
> Thank you



Re: Avro as a foundation of a JSON based system

2016-11-18 Thread Zoltan Farkas
I recall that it would fail if you have extra fields in the json that are not 
defined in the reader schema and not in the writer schema.
let me look into it and will get back to you.

—Z


> On Nov 18, 2016, at 7:21 AM, Josh <jof...@gmail.com> wrote:
> 
> Hi Zoltan,
> 
> Your ExtendedJsonDecoder / Encoder looks really useful for doing the 
> conversions between JSON and Avro.
> 
> I just have a quick question -  when I use the ExtendedJsonDecoder with a 
> GenericDatumReader, I get an AvroTypeException whenever the JSON doesn't 
> conform to the Avro schema (as expected). However, if the JSON has some 
> additional fields (i.e. fields that are present in the JSON, but not present 
> in the Avro schema), then the reader ignores those extra fields and converts 
> the JSON to Avro successfully. Do you know if there's a simple way to make 
> the reader detect these extra fields, and throw an exception in that case?
> 
> Thanks,
> Josh
> 
> On Thu, Aug 11, 2016 at 3:52 PM, Zoltan Farkas <zolyfar...@yahoo.com 
> <mailto:zolyfar...@yahoo.com>> wrote:
> We are doing the same successfully so far… here is some detail:
> 
> we do not use the standard JSON Encoders/Decoders from the avro project and 
> we have our own which provide a more “natural” JSON encoding that implements:
> 
> https://issues.apache.org/jira/browse/AVRO-1582 
> <https://issues.apache.org/jira/browse/AVRO-1582>
> 
> For us it was also important to fix:
> 
> https://issues.apache.org/jira/browse/AVRO-1723 
> <https://issues.apache.org/jira/browse/AVRO-1723>
> 
> We had to use our own fork to be able to fix/implement our needs faster, 
> which you can look at: https://github.com/zolyfarkas/avro 
> <https://github.com/zolyfarkas/avro> 
> 
> Here is how we use the avro schemas:
> 
> We develop our avro schema’s in separate projects “schema projects”.
> 
> These projects are standard maven projects, stored in version control, build 
> with CI and published to a maven repo the following:
> 1) avro generated java objects, sources and javadoc.
> 2) c# generated objects. (accessible with nugget to everybody)
> 3) zip package containing all schemas.
> 
> We use avro IDL to define the schemas in the project, the avsc json format is 
> difficult to read and maintain, the schema json is only a wire format for us.
> 
> We see these advantages:
> 
> 1) Building/Releasing a schema project is identical with releasing any maven 
> project. (Jenkins, maven release plugin...)
> 2) Using this we can take advantage of the maven dependency system and reuse 
> schemas. it is as simple as adding a  in your pom and a import 
> statement in your idl. (C# uses nugget)
> 3) As a side result our maven repo becomes a schema repo. And so far we see 
> no reason to use a dedicated schema repo like: 
> https://issues.apache.org/jira/browse/AVRO-1124 
> <https://issues.apache.org/jira/browse/AVRO-1124> 
> 4) the schema owner not only publishes schemas but also publishes al DTOs for 
> java and .NET, this way any team that needs to use the schema has no need to 
> generate code, all they need is to add a package dependency to they project.
> 5) During the build we also validate compatibiliy with the previously 
> released schemas.
> 6) During the build we also validate schema quality. (comments on fields, 
> naming…). We are planning to make this maven plugin open source.
> 7) Maven dependencies give you all the data needed to figure out what apps 
> use a schema like: group:myschema:3.0
> 8) A rest service that uses a avro object for payload, can serve/accept data 
> in: application/octet-stream;fmt=avro (avro binary), 
> application/json;fmt=avro (classic json encoding), 
> application/json;fmt=enhanced(AVRO-1582) allowing us to pick the right format 
> for the right use case. (AVRO-1582 json can be significantly smaller in size 
> than binary on certain type of data)
> 9) During the build we generate improved HTML doc for the avro objects, like: 
> http://zolyfarkas.github.io/spf4j/spf4j-core/avrodoc.html#/ 
> <http://zolyfarkas.github.io/spf4j/spf4j-core/avrodoc.html#/> 
> 
> The more we leverage avro the more use cases we find like:
> 
> 1) config discovery plugin that scans code for uses of System.getProperty… 
> and generates a avro idl : 
> http://zolyfarkas.github.io/spf4j/spf4j-config-discovery-maven-plugin/index.html
>  
> <http://zolyfarkas.github.io/spf4j/spf4j-config-discovery-maven-plugin/index.html>
>  
> 2) generate avro idl from jdbc metadata...
> 
> hope it helps!
> 
> cheers
> 
> —Z
> 
> 
>> On Aug 11, 2016, at 6:23 AM, Elliot West <tea...@gmail.com 
>> <mailto:tea...@gmail.com>> wro

Re: Avro as a foundation of a JSON based system

2016-08-11 Thread Zoltan Farkas
We are doing the same successfully so far… here is some detail:

we do not use the standard JSON Encoders/Decoders from the avro project and we 
have our own which provide a more “natural” JSON encoding that implements:

https://issues.apache.org/jira/browse/AVRO-1582 


For us it was also important to fix:

https://issues.apache.org/jira/browse/AVRO-1723 


We had to use our own fork to be able to fix/implement our needs faster, which 
you can look at: https://github.com/zolyfarkas/avro 
 

Here is how we use the avro schemas:

We develop our avro schema’s in separate projects “schema projects”.

These projects are standard maven projects, stored in version control, build 
with CI and published to a maven repo the following:
1) avro generated java objects, sources and javadoc.
2) c# generated objects. (accessible with nugget to everybody)
3) zip package containing all schemas.

We use avro IDL to define the schemas in the project, the avsc json format is 
difficult to read and maintain, the schema json is only a wire format for us.

We see these advantages:

1) Building/Releasing a schema project is identical with releasing any maven 
project. (Jenkins, maven release plugin...)
2) Using this we can take advantage of the maven dependency system and reuse 
schemas. it is as simple as adding a  in your pom and a import 
statement in your idl. (C# uses nugget)
3) As a side result our maven repo becomes a schema repo. And so far we see no 
reason to use a dedicated schema repo like: 
https://issues.apache.org/jira/browse/AVRO-1124 
 
4) the schema owner not only publishes schemas but also publishes al DTOs for 
java and .NET, this way any team that needs to use the schema has no need to 
generate code, all they need is to add a package dependency to they project.
5) During the build we also validate compatibiliy with the previously released 
schemas.
6) During the build we also validate schema quality. (comments on fields, 
naming…). We are planning to make this maven plugin open source.
7) Maven dependencies give you all the data needed to figure out what apps use 
a schema like: group:myschema:3.0
8) A rest service that uses a avro object for payload, can serve/accept data 
in: application/octet-stream;fmt=avro (avro binary), application/json;fmt=avro 
(classic json encoding), application/json;fmt=enhanced(AVRO-1582) allowing us 
to pick the right format for the right use case. (AVRO-1582 json can be 
significantly smaller in size than binary on certain type of data)
9) During the build we generate improved HTML doc for the avro objects, like: 
http://zolyfarkas.github.io/spf4j/spf4j-core/avrodoc.html#/ 
 

The more we leverage avro the more use cases we find like:

1) config discovery plugin that scans code for uses of System.getProperty… and 
generates a avro idl : 
http://zolyfarkas.github.io/spf4j/spf4j-config-discovery-maven-plugin/index.html
 

 
2) generate avro idl from jdbc metadata...

hope it helps!

cheers

—Z


> On Aug 11, 2016, at 6:23 AM, Elliot West  wrote:
> 
> Hello,
> 
> We are building a data processing system that has the following required 
> properties:
> Data is produced/consumed in JSON format
> These JSON documents must always adhere to a schema
> The schema must be defined in JSON also
> It should be possible to evolve schemas and verify schema compatibility
> I initially started looking at Avro, not as a solution, but to understand how 
> it schema evolution can be managed. However, I quickly discovered that with 
> its JSON support it is able to meet all of my requirements.
> 
> I am now considering a system where data structure is defined using the Avro 
> JSON schema, data is submitted using JSON that is then internally decoded 
> into Avro records, these records are then eventually encoded back into JSON 
> at the point of consumption. It seems to me that I can then take advantage of 
> Avro’s schema evolution features, while only ever exposing JSON to consumers 
> and producers. Aside from the dependency on Avro’s JSON schema syntax, the 
> use of Avro then becomes an internal implementation detail.
> 
> As I am completely new to Avro, I was wondering if this is a credible idea, 
> or if anyone would care to share their experiences of similar systems that 
> they have built?
> 
> Many thanks,
> 
> Elliot.
>