I drafted an AEP for unit metadata on schema: https://docs.google.com/document/d/1IeVAtf6YcAAn35D4jmFQJjPpEMgEu79wWVMW37KvNps/
On Tue, Jul 16, 2019 at 1:35 PM Erik Erlandson <eerla...@redhat.com> wrote: > Hi Ryan, > Those are all great questions. They're all issues I have ideas about but > I'd want Avro community input for as well. For that reason I answered them > all on AVRO-2474 <https://issues.apache.org/jira/browse/AVRO-2474> > Cheers! > E > > On Tue, Jul 16, 2019 at 3:13 AM Ryan Skraba <r...@skraba.com> wrote: > >> Hello! I've been thinking about this and I generally like the idea of >> stronger types with units :D >> >> I have some questions about what you are thinking of when you say "first >> class concept" in Avro: >> - Would you expect a writer schema that wrote a Fahrenheit field and a >> reader schema that reads Celsius to interact transparently with generic >> data? >> - What about conversions that lose precision (i.e., if the above >> conversion >> was on an INT field)? >> - How much of "unit" support should be mandatory in the spec for cross >> language operation? (a unit-aware Scala writer with a Fahrenheit field >> and >> a non-unit-aware reader with a Celsius field). >> - To what degree would a generic reader of Avro data be required to >> support >> quantity wrappers (i.e. how can we opt-in/opt-out cleanly from being >> unit-aware)? >> >> At scale, I'd be particularly keen to see the conversion detection >> (between >> two schemas / fields / quantities) take place once, and then the >> calculation reused for all of the subsequent datum passing through, but >> I'm >> not sure how that would work. >> >> We have some experience with passing a lot of client data through Avro, >> and >> we use generic data quite a bit -- I'd be tempted to think of "float >> (metres)" as a distinct type from "float (minutes)", but it would be a >> huge >> (but potentially interesting) change for the way we look at data. That >> being said, as far as units go, we see a lot more unitless values >> (quantity >> of items, percents and other ratios, ratings). The most frequent numeric >> values with units that we see are probably money or geolocation (in >> practice, already normalized to lat/long -- although I just learned about >> UTM!). Surprisingly, there's not as much SI-type unit data as you might >> expect. >> >> I can definitely see the value of using a "unit" annotation in a generated >> specific record for a supported language -- as proven by your scala work! >> That might be an easy first target while working out what a first-class >> concept in the spec would entail. I missed Berlin Buzzwords by a day, but >> enjoyed the video, thanks! >> >> Ryan >> >> >> >> On Tue, Jul 16, 2019 at 1:24 AM Erik Erlandson <eerla...@redhat.com> >> wrote: >> >> > If I'm interpreting the situation correctly, there is an "Avro >> Enhancement >> > Proposal", but none have been filed in nearly a decade: >> > >> https://cwiki.apache.org/confluence/display/AVRO/Avro+Enhancement+Proposals >> > >> > As a start, I submitted a jira to track this idea: >> > https://issues.apache.org/jira/browse/AVRO-2474 >> > >> > >> > >> > On Mon, Jul 8, 2019 at 10:42 AM Erik Erlandson <eerla...@redhat.com> >> > wrote: >> > >> > > >> > > What should I do to move this forward? Does Avro have a PIP process? >> > > >> > > >> > > On Sat, Jun 29, 2019 at 3:26 PM Erik Erlandson <eerla...@redhat.com> >> > > wrote: >> > > >> > >> >> > >> Regarding schema, my proposal for fingerprints would be that units >> are >> > >> fingerprinted based on their canonical form, as defined here >> > >> < >> > >> http://erikerlandson.github.io/blog/2019/05/03/algorithmic-unit-analysis/ >> > >. >> > >> Any two unit expressions having the same canonical form (including >> the >> > >> corresponding coefficients) are exactly equivalent, and so their >> > >> fingerprints can be the same. Possibly the unit could be stored on >> the >> > >> schema in canonical form by convention, although canonical forms are >> > >> frequently not as intuitive to humans and so in that case the >> > documentation >> > >> value of the unit might be reduced for humans examining the schema. >> > >> >> > >> For schema evolution, a unit change such that the previous and new >> unit >> > >> are convertable (also defined as at the above link) would be well >> > defined, >> > >> and automatic transformation would just be the correct unit >> conversion >> > >> (e.g. seconds to milliseconds). If the unit changes to a >> non-convertable >> > >> unit (e.g. seconds to bytes) then no automatic transformation exists, >> > and >> > >> attempting to resolve the old and new schema would be an error. Note >> > that >> > >> establishing the conversion assumes that both original and new >> schemas >> > are >> > >> available at read time. >> > >> >> > >> >> > >> On Sat, Jun 29, 2019 at 11:55 AM Niels Basjes <ni...@basj.es> wrote: >> > >> >> > >>> I think we should approach this idea in two parts: >> > >>> >> > >>> 1) The schema. Things like does a different unit mean a different >> > schema >> > >>> fingerprint even though the bytes remain the same. What does a >> > different >> > >>> unit mean for schema evolution. >> > >>> >> > >>> 2) Language specifics. Scala has different possibilities than Java. >> > >>> >> > >>> On Sat, Jun 29, 2019, 18:59 Erik Erlandson <eerla...@redhat.com> >> > wrote: >> > >>> >> > >>> > I've been puzzling over what can be done to support this in more >> > >>> > widely-used languages. The dilemma relative to the current >> language >> > >>> > ecosystem is that languages with "modern" type systems (Haskell, >> > Rust, >> > >>> > Scala, etc) capable of supporting compile-time unit checking, in >> the >> > >>> > particular style I've been exploring, are not yet widely used. >> > >>> > >> > >>> > With respect to Java, a couple approaches are plausible. One is to >> > >>> enhance >> > >>> > the language, for example with Java-8 compiler plugins. Another >> might >> > >>> be to >> > >>> > implement a unit type system similar to squants >> > >>> > <https://github.com/typelevel/squants>. This style of unit type >> > >>> system is >> > >>> > not as flexible or intuitive as what can be done with Scala's >> latest >> > >>> type >> > >>> > system sorcery, but it would allow the community to build out a >> Java >> > >>> native >> > >>> > type system that supports compile-time unit analysis. And its >> > coverage >> > >>> of >> > >>> > standard units could be made very good, as squants itself >> > demonstrates. >> > >>> > >> > >>> > Python would also be a high-coverage target. I'm even less sure >> what >> > >>> to do >> > >>> > for python, as it has no compile-time type checking, but perhaps a >> > >>> > squants-like python class system would add value. Maybe python's >> new >> > >>> > type-hints feature could be leveraged? >> > >>> > >> > >>> > Regarding unit expression representation, I'm not unhappy with >> what >> > >>> I've >> > >>> > prototyped in `coulomb-avro`, in broad strokes. It has >> deficiencies >> > >>> that >> > >>> > would need addressing. It doesn't yet support standard unit >> > >>> abbreviations, >> > >>> > nor does it understand plurals (e.g. it can parse "second" but not >> > >>> > "seconds"). Since it's "unit" field is just a custom metadata key, >> > >>> there is >> > >>> > no enforcement. Parsers are currently instantiated via explicit >> lists >> > >>> of >> > >>> > types, which is a property I like, but that may not work well in a >> > >>> world >> > >>> > where multiple language bindings must be supported in a portable >> > >>> manner. >> > >>> > >> > >>> > >> > >>> > >> > >>> > On Sat, Jun 29, 2019 at 1:46 AM Niels Basjes <ni...@basj.es> >> wrote: >> > >>> > >> > >>> > > Hi, >> > >>> > > >> > >>> > > I attended your talk in Berlin and at the end I thought "too bad >> > >>> this is >> > >>> > > only Scala". >> > >>> > > >> > >>> > > I think it's a good idea to have this in Avro. >> > >>> > > >> > >>> > > The details will be tricky: How to encode the units in the >> schema >> > for >> > >>> > > example. >> > >>> > > Especially because of the automatic conversion you spoke about. >> > >>> > > >> > >>> > > Niels >> > >>> > > >> > >>> > > On Fri, Jun 28, 2019, 23:58 Erik Erlandson <eerla...@redhat.com >> > >> > >>> wrote: >> > >>> > > >> > >>> > > > Hi Avro community, >> > >>> > > > >> > >>> > > > Recently I have been experimenting with avro schema that are >> > >>> extended >> > >>> > > with >> > >>> > > > a "unit" field. By "unit" I mean expressions like "second", or >> > >>> > > "megabyte" - >> > >>> > > > that is "units of measure". >> > >>> > > > >> > >>> > > > I delivered a short talk on my experiments at Berlin >> Buzzwords, >> > >>> which >> > >>> > can >> > >>> > > > be viewed here: >> > >>> > > > https://www.youtube.com/watch?v=qrQmB2KFKE8 >> > >>> > > > I also wrote a short blog post that may be faster to ingest: >> > >>> > > > >> > >>> > > > >> > >>> > > >> > >>> > >> > >>> >> > >> http://erikerlandson.github.io/blog/2019/05/23/unit-types-for-avro-schema-integrating-avro-with-coulomb/ >> > >>> > > > >> > >>> > > > I received some audience interest in making this concept >> "first >> > >>> class" >> > >>> > > for >> > >>> > > > avro, and so I'm writing to see what the avro dev community >> > thinks >> > >>> of >> > >>> > the >> > >>> > > > idea. One issue is that this kind of unit checking is >> currently >> > >>> only >> > >>> > > > available for Scala (and specifically scala 2.13 +). >> > >>> > > > >> > >>> > > > The Scala project itself is here: >> > >>> > > > https://github.com/erikerlandson/coulomb >> > >>> > > > >> > >>> > > > Cheers, >> > >>> > > > Erik >> > >>> > > > >> > >>> > > >> > >>> > >> > >>> >> > >> >> > >> >