RE: Formal spec for Avro Schema

2024-05-15 Thread Clemens Vasters via user
Hi Martin,

I am saying that the specification of the schema is currently entangled with 
the specification of the serialization framework. Avro Schema is useful and 
usable even if you never touch the Avro binaries (the framework, an 
implementation using the spec).

I am indeed proposing to separate the schema spec from the specs of the Avro 
binary encoding and the Avro JSON encoding, which also avoids strange 
entanglements like the JSON encoding pointing to the schema description’s 
default values section, which is in itself rather lacking in precision, i.e. 
the encoding rule for binary or fixed is “defined” with a rather terse example: 
"\u00ff"

Microsoft would like to propose Avro and Avro Schema in several standardization 
efforts, but we need a spec that works in those contexts and that can stand on 
its own. I would also like to see “application/avro” as a formal media type, 
but the route towards that only goes via formal standardization of both schema 
and encodings.

I believe the Avro project’s reach and importance is such that schema and 
encodings should have formal specs that can stand on their own as JSON and CBOR 
and AMQP and XML and OPC/Binary and other serialization schemas/formats do. I 
don’t think existence of a formal spec gets in the way of progress and Avro is 
so mature that the spec captures a fairly stable state.

Best Regards
Clemens

From: Martin Grigorov 
Sent: Wednesday, May 15, 2024 10:54 AM
To: d...@avro.apache.org
Cc: user@avro.apache.org
Subject: Re: Formal spec for Avro Schema

Hi Clemens,

On Wed, May 15, 2024 at 11:18 AM Clemens Vasters 
mailto:cleme...@microsoft.com.invalid>> wrote:
Hi Martin,

we find Avro Schema to be a great fit for describing application data 
structures in general and even independent of wire-serialization scenarios.

Therefore, I would like to have a spec that focuses specifically on the schema 
format, is grounded in the IETF RFC specs, and which follows the conventions 
set by IETF, so that folks who need a sane schema format to describe data 
structures independent of implementation can use that.

Do you say that the specification document is implementation dependent ?
If this is the case then maybe we should work on improving it instead of 
duplicating it.


The benefit for the Avro serialization framework of having such a formal spec 
that is untangled from the wire-serialization specs is that all schemas defined 
by that schema model are compatible with the framework.

What do you mean by "framework" here ?


The differences are organization, scope, and language style (including keywords 
etc.). The expressed ruleset is the same.

I don't think it is a good idea to add a second document that is very similar 
to the specification but uses a different language style.
To me this looks like a duplication.
IMO it would be better to suggest (many) (smaller) improvements for the 
existing document.



Best Regards
Clemens

-Original Message-
From: Martin Grigorov mailto:mgrigo...@apache.org>>
Sent: Wednesday, May 15, 2024 9:26 AM
To: d...@avro.apache.org
Cc: user@avro.apache.org
Subject: Re: Formal spec for Avro Schema

[Sie erhalten nicht häufig E-Mails von 
mgrigo...@apache.org. Weitere Informationen, warum 
dies wichtig ist, finden Sie unter 
https://aka.ms/LearnAboutSenderIdentification ]

Hi Clemens,

What is the difference between your document and the specification [1] ?
I haven't read it completely but it looks very similar to the specification to 
me.

1. https://avro.apache.org/docs/1.11.1/specification/
2.
https://github.com/apache/avro/tree/main/doc/content/en/docs/%2B%2Bversion%2B%2B/Specification
- sources of the specification

On Wed, May 15, 2024 at 9:28 AM Clemens Vasters 
mailto:cleme...@microsoft.com>.invalid> wrote:

> I wrote a formal spec for the Avro Schema format.
>
>
>
> https://gist/
> .github.com%2Fclemensv%2F498c481965c425b218ee156b38b49333=05%7C02
> %7Cclemensv%40microsoft.com%7C5cd57d6ebe504e02e6dd08dc74b06a33%7C72f98
> 8bf86f141af91ab2d7cd011db47%7C1%7C0%7C638513548275308005%7CUnknown%7CT
> WFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI
> 6Mn0%3D%7C0%7C%7C%7C=n24LJspeNxYRKjlD0tgJzxQh3CzuILK%2FRe30gbarB
> ec%3D=0
>
>
>
> Where would that go in the repo?
>
>
>
>
>
>
>  microsoft.com%2Fen-us%2Fnews%2FImageDetail.aspx%3Fid%3D4DABA54CBB4D25A
> 9E9905BC59E4A6D44E33694EA=05%7C02%7Cclemensv%40microsoft.com%7C5c
> d57d6ebe504e02e6dd08dc74b06a33%7C72f988bf86f141af91ab2d7cd011db47%7C1%
> 7C0%7C638513548275312403%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiL
> CJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=x6ZAZ
> YEAjqkSVznt3N%2FKGjZzE%2BJietvHZuaiqVQYuDY%3D=0>
>
> *Clemens Vasters*
>
> Messaging Platform Architect
>
> Microsoft Azure
>
> È+49 151 44063557
>
> *  

RE: Formal spec for Avro Schema

2024-05-15 Thread Clemens Vasters via user
Hi Martin,

we find Avro Schema to be a great fit for describing application data 
structures in general and even independent of wire-serialization scenarios.

Therefore, I would like to have a spec that focuses specifically on the schema 
format, is grounded in the IETF RFC specs, and which follows the conventions 
set by IETF, so that folks who need a sane schema format to describe data 
structures independent of implementation can use that.

The benefit for the Avro serialization framework of having such a formal spec 
that is untangled from the wire-serialization specs is that all schemas defined 
by that schema model are compatible with the framework.

The differences are organization, scope, and language style (including keywords 
etc.). The expressed ruleset is the same.

Best Regards
Clemens

-Original Message-
From: Martin Grigorov 
Sent: Wednesday, May 15, 2024 9:26 AM
To: d...@avro.apache.org
Cc: user@avro.apache.org
Subject: Re: Formal spec for Avro Schema

[Sie erhalten nicht häufig E-Mails von mgrigo...@apache.org. Weitere 
Informationen, warum dies wichtig ist, finden Sie unter 
https://aka.ms/LearnAboutSenderIdentification ]

Hi Clemens,

What is the difference between your document and the specification [1] ?
I haven't read it completely but it looks very similar to the specification to 
me.

1. https://avro.apache.org/docs/1.11.1/specification/
2.
https://github.com/apache/avro/tree/main/doc/content/en/docs/%2B%2Bversion%2B%2B/Specification
- sources of the specification

On Wed, May 15, 2024 at 9:28 AM Clemens Vasters 
 wrote:

> I wrote a formal spec for the Avro Schema format.
>
>
>
> https://gist/
> .github.com%2Fclemensv%2F498c481965c425b218ee156b38b49333=05%7C02
> %7Cclemensv%40microsoft.com%7C5cd57d6ebe504e02e6dd08dc74b06a33%7C72f98
> 8bf86f141af91ab2d7cd011db47%7C1%7C0%7C638513548275308005%7CUnknown%7CT
> WFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI
> 6Mn0%3D%7C0%7C%7C%7C=n24LJspeNxYRKjlD0tgJzxQh3CzuILK%2FRe30gbarB
> ec%3D=0
>
>
>
> Where would that go in the repo?
>
>
>
>
>
>
>  microsoft.com%2Fen-us%2Fnews%2FImageDetail.aspx%3Fid%3D4DABA54CBB4D25A
> 9E9905BC59E4A6D44E33694EA=05%7C02%7Cclemensv%40microsoft.com%7C5c
> d57d6ebe504e02e6dd08dc74b06a33%7C72f988bf86f141af91ab2d7cd011db47%7C1%
> 7C0%7C638513548275312403%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiL
> CJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=x6ZAZ
> YEAjqkSVznt3N%2FKGjZzE%2BJietvHZuaiqVQYuDY%3D=0>
>
> *Clemens Vasters*
>
> Messaging Platform Architect
>
> Microsoft Azure
>
> È+49 151 44063557
>
> *  cleme...@microsoft.com
> European Microsoft Innovation Center GmbH | Gewürzmühlstrasse 11 |
> 80539
> Munich| Germany
> Geschäftsführer/General Managers: Keith Dolliver, Benjamin O. Orndorff
> Amtsgericht Aachen, HRB 12066
>
>
>
>
>


Formal spec for Avro Schema

2024-05-15 Thread Clemens Vasters via user
I wrote a formal spec for the Avro Schema format.

https://gist.github.com/clemensv/498c481965c425b218ee156b38b49333

Where would that go in the repo?


[cid:image001.jpg@01DAA6A1.96E35FC0]
Clemens Vasters
Messaging Platform Architect
Microsoft Azure
È+49 151 44063557
*  cleme...@microsoft.com
European Microsoft Innovation Center GmbH | Gewürzmühlstrasse 11 | 80539 
Munich| Germany
Geschäftsführer/General Managers: Keith Dolliver, Benjamin O. Orndorff
Amtsgericht Aachen, HRB 12066




Re: Avro JSON Encoding

2024-04-24 Thread Clemens Vasters via user
Hi JB,

since there seems to be interest in the group even if not full consensus on the 
scope, I propose that I open an umbrella issue on this with more specific focus 
on the "what"/"how" more than the "why" as I did in the opening email, which 
can then be broken down into individual feature issues. I can work on that 
early next week.

Best Regards
Clemens


Von: Jean-Baptiste Onofré 
Gesendet: Donnerstag, April 18, 2024 10:58 AM
An: Clemens Vasters 
Cc: Jean-Baptiste Onofré ; user@avro.apache.org 

Betreff: Re: Avro JSON Encoding

Hi Clemens,

I propose to wait a bit to give a chance to the community to review
your email and points.

Then, we will create the Jira accordingly.

Regards
JB

On Thu, Apr 18, 2024 at 9:20 AM Clemens Vasters  wrote:
>
> Hi JB,
>
>
>
> I have not done that yet. I’m happy to break that up into items once I get 
> the sense that this is a direction we can get to a consensus on.
>
>
>
> Shall I file the whole email as a “New Feature” issue first?
>
>
>
> Thanks
>
> Clemens
>
>
>
> From: Jean-Baptiste Onofré 
> Sent: Thursday, April 18, 2024 10:17 AM
> To: Clemens Vasters ; user@avro.apache.org
> Subject: Re: Avro JSON Encoding
>
>
>
> Hi Clemens
>
>
>
> Thanks for the detailed email.
>
>
>
> Quick question : did you already create Jira about each improvements/issues ?
>
>
>
> I will take the time to read asap.
>
>
>
> Thanks
>
> Regards
>
> JB
>
>
>
> Le jeu. 18 avr. 2024 à 09:12, Clemens Vasters via user  
> a écrit :
>
> Hi everyone,
>
>
>
> the current JSON Encoding approach severely limits interoperability with 
> other JSON serialization frameworks. In my view, the JSON Encoding is only 
> really useful if it acts as a bridge into and from JSON-centric applications 
> and it currently gets in its own way.
>
>
>
> The current encoding being what it is, there should be an alternate mode that 
> emphasizes interoperability with JSON “as-is” and allows Avro Schema to 
> describe existing JSON document instances such that I can take someone’s 
> existing JSON document in on one side of a piece of software and emit Avro 
> binary on the other side while acting on the same schema.
>
>
>
> There are four specific issues:
>
>
>
> Binary Values
> Unions with Primitive Type Values and Enum Values
> Unions with Record Values
> DateTime
>
>
>
> One by one:
>
>
>
> 1. Binary values:
>
> -
>
>
>
> Binary values are (fixed and bytes) are encoded as escaped unicode literals. 
> While I appreciate the creative trick, it costs 6 bytes for each encoded 
> byte. I have a hard time finding any JSON libraries that provide a conversion 
> of such strings from/to byte arrays, so this approach appears to be 
> idiosyncratic for Avro’s JSON Encoding.
>
>
>
> The common way to encode binary in JSON is to use base64 encoding and that is 
> widely and well supported in libraries. Base64 is 33% larger than plain 
> bytes, the encoding chosen here is 500% (!) larger than plain bytes.
>
>
>
> The Avro decoder is schema-informed and it knows that a field is expected to 
> hold bytes, so it’s easy to mandate base64 for the field content in the 
> alternate mode.
>
>
>
> 2. Unions with Primitive Type Values and Enum Values
>
> -
>
>
>
> It’s common to express optionality in Avro Schema by creating a union with 
> the “null” type, e.g. [“string”, “null”]. The Avro JSON Encoding opts to 
> encode such unions, like any union, as { “{type}”: {value} } when the value 
> is non-null.
>
>
>
> This choice ignores common practice and the fact that JSON’s values are 
> dynamically typed (RFC8259 Section-3) and inherently accommodate unions. The 
> conformant way to encode a value choice of null or “string” into a JSON value 
> is plainly null and “string”.
>
>
>
> “foo” : null
>
> “foo”: “value”
>
>
>
> The “field default values” table in the Avro spec maps Avro types to the JSON 
> types null, boolean, integer, number, string, object, and array, all of which 
> can be encoded into and, more importantly, unambiguously decoded from a JSON 
> value. The only semi-ambiguous case is integer vs. number, which is a 
> convention in JSON rather than a distinct type, but any Avro serializer is 
> guided by type information and can easily make that distinction.
>
>
>
> 3. Unions with Record Values
>
> -
>
>
>
> The JSON Encoding pattern of unions also covers “record” typed values, of 
> course, and this is indeed a tricky scenario duri

Re: Avro JSON Encoding

2024-04-23 Thread Clemens Vasters via user
nd regards,
> Oscar
>
>
> --
> Oscar Westra van Holthe - Kind 
> mailto:os...@westravanholthe.nl>>
>
> Op do 18 apr. 2024 10:12 schreef Clemens Vasters via user 
> mailto:user@avro.apache.org>>:
>>
>> Hi everyone,
>>
>>
>>
>> the current JSON Encoding approach severely limits interoperability with 
>> other JSON serialization frameworks. In my view, the JSON Encoding is only 
>> really useful if it acts as a bridge into and from JSON-centric applications 
>> and it currently gets in its own way.
>>
>>
>>
>> The current encoding being what it is, there should be an alternate mode 
>> that emphasizes interoperability with JSON “as-is” and allows Avro Schema to 
>> describe existing JSON document instances such that I can take someone’s 
>> existing JSON document in on one side of a piece of software and emit Avro 
>> binary on the other side while acting on the same schema.
>>
>>
>>
>> There are four specific issues:
>>
>>
>>
>> Binary Values
>> Unions with Primitive Type Values and Enum Values
>> Unions with Record Values
>> DateTime
>>
>>
>>
>> One by one:
>>
>>
>>
>> 1. Binary values:
>>
>> -
>>
>>
>>
>> Binary values are (fixed and bytes) are encoded as escaped unicode literals. 
>> While I appreciate the creative trick, it costs 6 bytes for each encoded 
>> byte. I have a hard time finding any JSON libraries that provide a 
>> conversion of such strings from/to byte arrays, so this approach appears to 
>> be idiosyncratic for Avro’s JSON Encoding.
>>
>>
>>
>> The common way to encode binary in JSON is to use base64 encoding and that 
>> is widely and well supported in libraries. Base64 is 33% larger than plain 
>> bytes, the encoding chosen here is 500% (!) larger than plain bytes.
>>
>>
>>
>> The Avro decoder is schema-informed and it knows that a field is expected to 
>> hold bytes, so it’s easy to mandate base64 for the field content in the 
>> alternate mode.
>>
>>
>>
>> 2. Unions with Primitive Type Values and Enum Values
>>
>> -
>>
>>
>>
>> It’s common to express optionality in Avro Schema by creating a union with 
>> the “null” type, e.g. [“string”, “null”]. The Avro JSON Encoding opts to 
>> encode such unions, like any union, as { “{type}”: {value} } when the value 
>> is non-null.
>>
>>
>>
>> This choice ignores common practice and the fact that JSON’s values are 
>> dynamically typed (RFC8259 Section-3) and inherently accommodate unions. The 
>> conformant way to encode a value choice of null or “string” into a JSON 
>> value is plainly null and “string”.
>>
>>
>>
>> “foo” : null
>>
>> “foo”: “value”
>>
>>
>>
>> The “field default values” table in the Avro spec maps Avro types to the 
>> JSON types null, boolean, integer, number, string, object, and array, all of 
>> which can be encoded into and, more importantly, unambiguously decoded from 
>> a JSON value. The only semi-ambiguous case is integer vs. number, which is a 
>> convention in JSON rather than a distinct type, but any Avro serializer is 
>> guided by type information and can easily make that distinction.
>>
>>
>>
>> 3. Unions with Record Values
>>
>> -
>>
>>
>>
>> The JSON Encoding pattern of unions also covers “record” typed values, of 
>> course, and this is indeed a tricky scenario during deserialization since 
>> JSON does not have any built-in notion of type hints for “object” typed 
>> values.
>>
>>
>>
>> The problem of having to disambiguate instances of different types in a 
>> field value is a common one also for users of JSON Schema when using the 
>> “oneOf” construct, which is equivalent to Avro unions. There are two common 
>> strategies:
>>
>>
>>
>> - “Duck Typing”:  Every conformant JSON Schema Validator determines the 
>> validity of a JSON node against a “oneOf" rule by testing the instance 
>> against all available alternative schema definitions. Validation fails if 
>> there is not exactly one valid match.
>>
>> - Discriminators: OpenAPI, for instance, mandates a “discriminator” field 
>> (see https://spec.openapis.org/oas/latest.html#discriminator-object) for 
>> disambiguating “oneOf” constructs, whereby the discriminator property is 
>> part of each instance. That a

Re: Avro JSON Encoding

2024-04-19 Thread Clemens Vasters via user
Thank you, Ryan. I am specifically trying to avoid JSON specific attributes 
that would not be otherwise useful (hence "const" and "displayname") and I do 
indeed imagine the alternate encoding to be selected by a new switch on the 
encoders.

Gesendet von Outlook für iOS<https://aka.ms/o0ukef>

Von: Ryan Skraba 
Gesendet: Friday, April 19, 2024 5:57:37 PM
An: user@avro.apache.org 
Cc: Clemens Vasters 
Betreff: Re: Avro JSON Encoding

[Sie erhalten nicht häufig E-Mails von r...@skraba.com. Weitere Informationen, 
warum dies wichtig ist, finden Sie unter 
https://aka.ms/LearnAboutSenderIdentification ]

Hello!

A bit tongue in cheek: the one advantage of the current Avro JSON
encoding is that it drives users rapidly to prefer the binary
encoding!  In its current state, Avro isn't really a satisfactory
toolkit for JSON interoperability, while it shines for binary
interoperability. Using JSON with Avro schemas is pretty unwieldy and
a JSON data designer will almost never be entirely satisfied with the
JSON "shape" they can get... today it's useful for testing and
debugging.

That being said, it's hard to argue with improving this experience
where it can help developers that really want to use Avro JSON for
data transfer, especially for things accepting JSON where the
intention is clearly unambiguous or allowing optional attributes to be
missing.  I'd be enthusiastic to see some of these improvements,
especially if we keep the possibility of generating strict Avro JSON
for forwards and backwards compatibility.

My preference would be to avoid adding JSON-specific attributes to the
spec where possible.  Maybe we could consider implementing Avro JSON
"variants" by implementing encoder options, or alternative encorders
for an SDK. There's probably a nice balance between a rigorous and
interoperable (but less customizable) JSON encoding, and trying to
accommodate arbitrary JSON in the Avro project.

All my best and thanks for this analysis -- I'm excited to see where
this leads!  Ryan









On Thu, Apr 18, 2024 at 8:01 PM Oscar Westra van Holthe - Kind
 wrote:
>
> Thank you Clemens,
>
> This is a very detailed set of proposals, and it looks like it would work.
>
> I do however, feel we'd need to define a way to unions with records. Your 
> proposal lists various options, of which the discriminatory option seems most 
> portable to me.
>
> You mention the "displayName" proposal. I don't like that, as it mixes data 
> with UI elements. The discriminator option can specify a fixed or 
> configurable field to hold the type of the record.
>
> Kind regards,
> Oscar
>
>
> --
> Oscar Westra van Holthe - Kind 
>
> Op do 18 apr. 2024 10:12 schreef Clemens Vasters via user 
> :
>>
>> Hi everyone,
>>
>>
>>
>> the current JSON Encoding approach severely limits interoperability with 
>> other JSON serialization frameworks. In my view, the JSON Encoding is only 
>> really useful if it acts as a bridge into and from JSON-centric applications 
>> and it currently gets in its own way.
>>
>>
>>
>> The current encoding being what it is, there should be an alternate mode 
>> that emphasizes interoperability with JSON “as-is” and allows Avro Schema to 
>> describe existing JSON document instances such that I can take someone’s 
>> existing JSON document in on one side of a piece of software and emit Avro 
>> binary on the other side while acting on the same schema.
>>
>>
>>
>> There are four specific issues:
>>
>>
>>
>> Binary Values
>> Unions with Primitive Type Values and Enum Values
>> Unions with Record Values
>> DateTime
>>
>>
>>
>> One by one:
>>
>>
>>
>> 1. Binary values:
>>
>> -
>>
>>
>>
>> Binary values are (fixed and bytes) are encoded as escaped unicode literals. 
>> While I appreciate the creative trick, it costs 6 bytes for each encoded 
>> byte. I have a hard time finding any JSON libraries that provide a 
>> conversion of such strings from/to byte arrays, so this approach appears to 
>> be idiosyncratic for Avro’s JSON Encoding.
>>
>>
>>
>> The common way to encode binary in JSON is to use base64 encoding and that 
>> is widely and well supported in libraries. Base64 is 33% larger than plain 
>> bytes, the encoding chosen here is 500% (!) larger than plain bytes.
>>
>>
>>
>> The Avro decoder is schema-informed and it knows that a field is expected to 
>> hold bytes, so it’s easy to mandate base64 for the field content in the 
>> alternate mode.
>>
>>
>>
>> 2. Unio

Re: Avro JSON Encoding

2024-04-18 Thread Clemens Vasters via user
The discriminator is "const".

I added "displayname" because I also have other scenarios for it and it appears 
like a good workaround for alias names that do not fit the "name", e.g. 
"$type". I am not passionate about "displayname", but "alias" is taken and it's 
going to be a user-supplied name in other scenarios. Two attributes with 
similar functions would be a it much.

To illustrate how I imagine "const" working:

[
   {
  "type":"record",
 "fields": [
   {
 "name": "typename",
  "type": "string",
   "const": "cat"
  },
  ... cat things ..
  ]
  },
  {
 "type":"record",
 "fields": [
 {
 "name": "typename",
  "type": "string",
   "const": "dog"
 },
 ... dog things ...
  ]
}
]

(Sorry about formatting being bad, did that on the phone)

To handle anyone's JSON the decoder will still have to support the duck typing 
that JSON Schema needs to do for oneOf, but the "const" declaration provides a 
cheap first option to test for before having to probe the whole structure. So 
even though the model is technically duck typing, the const declaration 
shortcuts it completely in an efficient implementation that looks there first.

You would use "const" on the fields that other frameworks designate as the 
discriminator and the value would be whatever is set by the publisher to 
identify the type they write.

With "displayname", assuming the publisher uses "$type" as the discriminator:

[
   {
  "type":"record",
 "fields": [
   {
 "name": "typename",
  "displayname": "$type",
  "type": "string",
   "const": "cat"
  },
  ... cat things ..
  ]
  },
  {
 "type":"record",
 "fields": [
 {
 "name": "typename",
  "displayname": "$type",
  "type": "string",
   "const": "dog"
 },
 ... dog things ...
  ]
}
]



Von: Oscar Westra van Holthe - Kind 
Gesendet: Donnerstag, April 18, 2024 8:00 PM
An: user@avro.apache.org ; Clemens Vasters 

Betreff: Re: Avro JSON Encoding

Sie erhalten nicht oft eine E-Mail von os...@westravanholthe.nl. Erfahren Sie, 
warum dies wichtig ist<https://aka.ms/LearnAboutSenderIdentification>
Thank you Clemens,

This is a very detailed set of proposals, and it looks like it would work.

I do however, feel we'd need to define a way to unions with records. Your 
proposal lists various options, of which the discriminatory option seems most 
portable to me.

You mention the "displayName" proposal. I don't like that, as it mixes data 
with UI elements. The discriminator option can specify a fixed or configurable 
field to hold the type of the record.

Kind regards,
Oscar


--
Oscar Westra van Holthe - Kind 
mailto:os...@westravanholthe.nl>>

Op do 18 apr. 2024 10:12 schreef Clemens Vasters via user 
mailto:user@avro.apache.org>>:
Hi everyone,

the current JSON Encoding approach severely limits interoperability with other 
JSON serialization frameworks. In my view, the JSON Encoding is only really 
useful if it acts as a bridge into and from JSON-centric applications and it 
currently gets in its own way.

The current encoding being what it is, there should be an alternate mode that 
emphasizes interoperability with JSON “as-is” and allows Avro Schema to 
describe existing JSON document instances such that I can take someone’s 
existing JSON document in on one side of a piece of software and emit Avro 
binary on the other side while acting on the same schema.

There are four specific issues:


  1.  Binary Values
  2.  Unions with Primitive Type Values and Enum Values
  3.  Unions with Record Values
  4.  DateTime

One by one:

1. Binary values:
-

Binary values are (fixed and bytes) are encoded as escaped unicode literals. 
While I appreciate the creative trick, it costs 6 bytes for each encoded byte. 
I have a hard time finding any JSON libraries that provide a conversion of such 
strings from/to byte arrays, so this approach appears to be idiosyncratic for 
Avro’s JSON Encoding.

The common way to encode binary in JSON is to use base64 encoding and that is 
widely and well supported in libraries. Base64 is 33% larger than plain bytes, 
the encoding chosen here is 500% (!) larger than plain bytes.

The Avr

RE: Avro JSON Encoding

2024-04-18 Thread Clemens Vasters via user
I literally do the “FWIW” here: 
https://github.com/clemensv/avrotize?tab=readme-ov-file#convert-json-schema-to-avro-schema

From: Andrew Otto 
Sent: Thursday, April 18, 2024 2:24 PM
To: user@avro.apache.org
Cc: Clemens Vasters 
Subject: Re: Avro JSON Encoding

Sie erhalten nicht oft eine E-Mail von 
o...@wikimedia.org<mailto:o...@wikimedia.org>. Erfahren Sie, warum dies wichtig 
ist<https://aka.ms/LearnAboutSenderIdentification>
This is a great proposal.  At the Wikimedia Foundation, we've explicitly chosen 
to use JSON as our streaming serialization 
format<https://techblog.wikimedia.org/2020/09/10/wikimedias-event-data-platform-or-json-is-ok-too/>.
  We considered using Avro JSON, but the need to use an Avro specific 
serialization for nullable types was the main reason we chose not to do so.  
We'd love to be able to more automatically convert between JSON and Avro 
Binary, and a proposal like this should allow us to do so!

> The conformant way to encode a value choice of null or “string” into a JSON 
> value is plainly null and “string”.
This is true, but we decided to do this in a different way.  In JSONSchema, 
'optional' fields are marked as such by not including them in the list of 
required fields.  So, instead of explicitly encoding an optional field value as 
'null', producers omit the field entirely.  When converting to different type 
systems (Flink, Spark, etc.) our converters explicitly always use the 
JSONSchema<https://gerrit.wikimedia.org/r/plugins/gitiles/wikimedia-event-utilities/+/refs/heads/master/eventutilities/src/main/java/org/wikimedia/eventutilities/core/event/types/JsonSchemaConverter.java>,
 so we know if a field should be present and nulled, even if it is omitted in 
the incoming record data.

FWIW, I believe this proposal could make JSONSchema and Avro Schemas equivalent 
(enough) to automatically generate one from the other, and use Avro libs to 
serialize/deserialize JSON directly.  Very cool!

-Andrew Otto
 Wikimedia Foundation



On Thu, Apr 18, 2024 at 4:17 AM Jean-Baptiste Onofré 
mailto:j...@nanthrax.net>> wrote:
Hi Clemens

Thanks for the detailed email.

Quick question : did you already create Jira about each improvements/issues ?

I will take the time to read asap.

Thanks
Regards
JB

Le jeu. 18 avr. 2024 à 09:12, Clemens Vasters via user 
mailto:user@avro.apache.org>> a écrit :
Hi everyone,

the current JSON Encoding approach severely limits interoperability with other 
JSON serialization frameworks. In my view, the JSON Encoding is only really 
useful if it acts as a bridge into and from JSON-centric applications and it 
currently gets in its own way.

The current encoding being what it is, there should be an alternate mode that 
emphasizes interoperability with JSON “as-is” and allows Avro Schema to 
describe existing JSON document instances such that I can take someone’s 
existing JSON document in on one side of a piece of software and emit Avro 
binary on the other side while acting on the same schema.

There are four specific issues:


  1.  Binary Values
  2.  Unions with Primitive Type Values and Enum Values
  3.  Unions with Record Values
  4.  DateTime

One by one:

1. Binary values:
-

Binary values are (fixed and bytes) are encoded as escaped unicode literals. 
While I appreciate the creative trick, it costs 6 bytes for each encoded byte. 
I have a hard time finding any JSON libraries that provide a conversion of such 
strings from/to byte arrays, so this approach appears to be idiosyncratic for 
Avro’s JSON Encoding.

The common way to encode binary in JSON is to use base64 encoding and that is 
widely and well supported in libraries. Base64 is 33% larger than plain bytes, 
the encoding chosen here is 500% (!) larger than plain bytes.

The Avro decoder is schema-informed and it knows that a field is expected to 
hold bytes, so it’s easy to mandate base64 for the field content in the 
alternate mode.

2. Unions with Primitive Type Values and Enum Values
-

It’s common to express optionality in Avro Schema by creating a union with the 
“null” type, e.g. [“string”, “null”]. The Avro JSON Encoding opts to encode 
such unions, like any union, as { “{type}”: {value} } when the value is 
non-null.

This choice ignores common practice and the fact that JSON’s values are 
dynamically typed (RFC8259 
Section-3<https://www.rfc-editor.org/rfc/rfc8259#section-3>) and inherently 
accommodate unions. The conformant way to encode a value choice of null or 
“string” into a JSON value is plainly null and “string”.

“foo” : null
“foo”: “value”

The “field default values” table in the Avro spec maps Avro types to the JSON 
types null, boolean, integer, number, string, object, and array, all of which 
can be encoded into and, more importantly, unambiguously decoded from a JSON 
value. The only semi-ambiguous case is integer vs. number, which is a 
convention in JSON rather than a d

"Avrotize" tool

2024-04-18 Thread Clemens Vasters via user
Hi everyone,

I'm interested in feedback on the "Avrotize" tool:

Git: https://github.com/clemensv/avrotize  
PyPI: https://pypi.org/project/avrotize/

Avrotize is a command-line tool for converting data structure definitions 
between different schema formats, using Apache Avro Schema as the integration 
schema model.

You can use the tool to convert between Avro Schema and other schema formats 
like JSON Schema, XML Schema (XSD), Protocol Buffers (Protobuf), ASN.1, and 
database schema formats like Apache Parquet files, Kusto Data Table Definition 
(KQL) and T-SQL Table Definition (MSSQL Server). I'm aiming to support more 
schemas, especially for databases. 

With this, you can also convert from JSON Schema to Protobuf going via Avro 
Schema.

You can also generate C#, Java, TypeScript, JavaScript, and Python code from 
the Avro Schema documents. The difference to the native Avro tools is that 
Avrotize can emit data classes without Avro library dependencies and, 
optionally, with annotations for JSON serialization libraries like Jackson or 
System.Text.Json. The C# code generator is furthest along in terms of 
serialization helper capabilities, but I'll bring the Java version up to that 
level next week. 

The JSON Schema to Avro Schema conversion has its own page: 
https://github.com/clemensv/avrotize/blob/master/jsonschema.md

Best Regards
Clemens


Clemens Vasters
Messaging Platform Architect
Microsoft Azure
cleme...@microsoft.com   
European Microsoft Innovation Center GmbH | Gewürzmühlstrasse 11 | 80539 
Munich| Germany 
Geschäftsführer/General Managers: Keith Dolliver, Benjamin O. Orndorff 
Amtsgericht Aachen, HRB 12066




RE: Avro JSON Encoding

2024-04-18 Thread Clemens Vasters via user
Hi JB,

I have not done that yet. I'm happy to break that up into items once I get the 
sense that this is a direction we can get to a consensus on.

Shall I file the whole email as a "New Feature" issue first?

Thanks
Clemens

From: Jean-Baptiste Onofré 
Sent: Thursday, April 18, 2024 10:17 AM
To: Clemens Vasters ; user@avro.apache.org
Subject: Re: Avro JSON Encoding

Hi Clemens

Thanks for the detailed email.

Quick question : did you already create Jira about each improvements/issues ?

I will take the time to read asap.

Thanks
Regards
JB

Le jeu. 18 avr. 2024 à 09:12, Clemens Vasters via user 
mailto:user@avro.apache.org>> a écrit :
Hi everyone,

the current JSON Encoding approach severely limits interoperability with other 
JSON serialization frameworks. In my view, the JSON Encoding is only really 
useful if it acts as a bridge into and from JSON-centric applications and it 
currently gets in its own way.

The current encoding being what it is, there should be an alternate mode that 
emphasizes interoperability with JSON "as-is" and allows Avro Schema to 
describe existing JSON document instances such that I can take someone's 
existing JSON document in on one side of a piece of software and emit Avro 
binary on the other side while acting on the same schema.

There are four specific issues:


  1.  Binary Values
  2.  Unions with Primitive Type Values and Enum Values
  3.  Unions with Record Values
  4.  DateTime

One by one:

1. Binary values:
-

Binary values are (fixed and bytes) are encoded as escaped unicode literals. 
While I appreciate the creative trick, it costs 6 bytes for each encoded byte. 
I have a hard time finding any JSON libraries that provide a conversion of such 
strings from/to byte arrays, so this approach appears to be idiosyncratic for 
Avro's JSON Encoding.

The common way to encode binary in JSON is to use base64 encoding and that is 
widely and well supported in libraries. Base64 is 33% larger than plain bytes, 
the encoding chosen here is 500% (!) larger than plain bytes.

The Avro decoder is schema-informed and it knows that a field is expected to 
hold bytes, so it's easy to mandate base64 for the field content in the 
alternate mode.

2. Unions with Primitive Type Values and Enum Values
-

It's common to express optionality in Avro Schema by creating a union with the 
"null" type, e.g. ["string", "null"]. The Avro JSON Encoding opts to encode 
such unions, like any union, as { "{type}": {value} } when the value is 
non-null.

This choice ignores common practice and the fact that JSON's values are 
dynamically typed (RFC8259 
Section-3<https://www.rfc-editor.org/rfc/rfc8259#section-3>) and inherently 
accommodate unions. The conformant way to encode a value choice of null or 
"string" into a JSON value is plainly null and "string".

"foo" : null
"foo": "value"

The "field default values" table in the Avro spec maps Avro types to the JSON 
types null, boolean, integer, number, string, object, and array, all of which 
can be encoded into and, more importantly, unambiguously decoded from a JSON 
value. The only semi-ambiguous case is integer vs. number, which is a 
convention in JSON rather than a distinct type, but any Avro serializer is 
guided by type information and can easily make that distinction.

3. Unions with Record Values
-

The JSON Encoding pattern of unions also covers "record" typed values, of 
course, and this is indeed a tricky scenario during deserialization since JSON 
does not have any built-in notion of type hints for "object" typed values.

The problem of having to disambiguate instances of different types in a field 
value is a common one also for users of JSON Schema when using the "oneOf" 
construct, which is equivalent to Avro unions. There are two common strategies:

- "Duck Typing":  Every conformant JSON Schema Validator determines the 
validity of a JSON node against a "oneOf" rule by testing the instance against 
all available alternative schema definitions. Validation fails if there is not 
exactly one valid match.
- Discriminators: OpenAPI, for instance, mandates a "discriminator" field (see 
https://spec.openapis.org/oas/latest.html#discriminator-object) for 
disambiguating "oneOf" constructs, whereby the discriminator property is part 
of each instance. That approach informs numerous JSON serialization frameworks, 
which implement discriminators under that assumption.

The Java Jackson library indeed supports the Avro JSON Encoding's style of 
putting the discriminator into a wrapper field name (JsonTypeInfo annotation, 
JsonTypeInfo.As.WRAPPER_OBJECT). Many other frameworks only support the 
property approach, though, including the two dominant ones for .NET, Pydantic 
of Python

Avro JSON Encoding

2024-04-18 Thread Clemens Vasters via user
Hi everyone,

the current JSON Encoding approach severely limits interoperability with other 
JSON serialization frameworks. In my view, the JSON Encoding is only really 
useful if it acts as a bridge into and from JSON-centric applications and it 
currently gets in its own way.

The current encoding being what it is, there should be an alternate mode that 
emphasizes interoperability with JSON "as-is" and allows Avro Schema to 
describe existing JSON document instances such that I can take someone's 
existing JSON document in on one side of a piece of software and emit Avro 
binary on the other side while acting on the same schema.

There are four specific issues:


  1.  Binary Values
  2.  Unions with Primitive Type Values and Enum Values
  3.  Unions with Record Values
  4.  DateTime

One by one:

1. Binary values:
-

Binary values are (fixed and bytes) are encoded as escaped unicode literals. 
While I appreciate the creative trick, it costs 6 bytes for each encoded byte. 
I have a hard time finding any JSON libraries that provide a conversion of such 
strings from/to byte arrays, so this approach appears to be idiosyncratic for 
Avro's JSON Encoding.

The common way to encode binary in JSON is to use base64 encoding and that is 
widely and well supported in libraries. Base64 is 33% larger than plain bytes, 
the encoding chosen here is 500% (!) larger than plain bytes.

The Avro decoder is schema-informed and it knows that a field is expected to 
hold bytes, so it's easy to mandate base64 for the field content in the 
alternate mode.

2. Unions with Primitive Type Values and Enum Values
-

It's common to express optionality in Avro Schema by creating a union with the 
"null" type, e.g. ["string", "null"]. The Avro JSON Encoding opts to encode 
such unions, like any union, as { "{type}": {value} } when the value is 
non-null.

This choice ignores common practice and the fact that JSON's values are 
dynamically typed (RFC8259 
Section-3) and inherently 
accommodate unions. The conformant way to encode a value choice of null or 
"string" into a JSON value is plainly null and "string".

"foo" : null
"foo": "value"

The "field default values" table in the Avro spec maps Avro types to the JSON 
types null, boolean, integer, number, string, object, and array, all of which 
can be encoded into and, more importantly, unambiguously decoded from a JSON 
value. The only semi-ambiguous case is integer vs. number, which is a 
convention in JSON rather than a distinct type, but any Avro serializer is 
guided by type information and can easily make that distinction.

3. Unions with Record Values
-

The JSON Encoding pattern of unions also covers "record" typed values, of 
course, and this is indeed a tricky scenario during deserialization since JSON 
does not have any built-in notion of type hints for "object" typed values.

The problem of having to disambiguate instances of different types in a field 
value is a common one also for users of JSON Schema when using the "oneOf" 
construct, which is equivalent to Avro unions. There are two common strategies:

- "Duck Typing":  Every conformant JSON Schema Validator determines the 
validity of a JSON node against a "oneOf" rule by testing the instance against 
all available alternative schema definitions. Validation fails if there is not 
exactly one valid match.
- Discriminators: OpenAPI, for instance, mandates a "discriminator" field (see 
https://spec.openapis.org/oas/latest.html#discriminator-object) for 
disambiguating "oneOf" constructs, whereby the discriminator property is part 
of each instance. That approach informs numerous JSON serialization frameworks, 
which implement discriminators under that assumption.

The Java Jackson library indeed supports the Avro JSON Encoding's style of 
putting the discriminator into a wrapper field name (JsonTypeInfo annotation, 
JsonTypeInfo.As.WRAPPER_OBJECT). Many other frameworks only support the 
property approach, though, including the two dominant ones for .NET, Pydantic 
of Python, and others. There's tooling like Redocly that flags that approach as 
a "mistake" (see 
https://redocly.com/docs/resources/discriminator/#property-outside-of-the-object).

What that means is that most existing JSON instances with ambiguous types will 
either use property discriminators or the implementation will rely on duck 
typing as JSON Schema does for validation. The Avro JSON Encoding approach is 
rare and is also counterintuitive for anyone comparing the declared object 
structure and the JSON structure who is not familiar with Avro's encoding 
rules. It has confused a lot of people in our house, for sure.

Proposed is the following approach:

a) add a new, optional "const" attribute that can be applied to any record 
field declaration that is of a primitive type. When present, the attribute 
causes the field to always have this