I drafted an AEP for unit metadata on schema:
https://docs.google.com/document/d/1IeVAtf6YcAAn35D4jmFQJjPpEMgEu79wWVMW37KvNps/


On Tue, Jul 16, 2019 at 1:35 PM Erik Erlandson <eerla...@redhat.com> wrote:

> Hi Ryan,
> Those are all great questions. They're all issues I have ideas about but
> I'd want Avro community input for as well. For that reason I answered them
> all on AVRO-2474 <https://issues.apache.org/jira/browse/AVRO-2474>
> Cheers!
> E
>
> On Tue, Jul 16, 2019 at 3:13 AM Ryan Skraba <r...@skraba.com> wrote:
>
>> Hello!  I've been thinking about this and I generally like the idea of
>> stronger types with units :D
>>
>> I have some questions about what you are thinking of when you say "first
>> class concept" in Avro:
>> - Would you expect a writer schema that wrote a Fahrenheit field and a
>> reader schema that reads Celsius to interact transparently with generic
>> data?
>> - What about conversions that lose precision (i.e., if the above
>> conversion
>> was on an INT field)?
>> - How much of "unit" support should be mandatory in the spec for cross
>> language operation?  (a unit-aware Scala writer with a Fahrenheit field
>> and
>> a non-unit-aware reader with a Celsius field).
>> - To what degree would a generic reader of Avro data be required to
>> support
>> quantity wrappers (i.e. how can we opt-in/opt-out cleanly from being
>> unit-aware)?
>>
>> At scale, I'd be particularly keen to see the conversion detection
>> (between
>> two schemas / fields / quantities) take place once, and then the
>> calculation reused for all of the subsequent datum passing through, but
>> I'm
>> not sure how that would work.
>>
>> We have some experience with passing a lot of client data through Avro,
>> and
>> we use generic data quite a bit -- I'd be tempted to think of "float
>> (metres)" as a distinct type from "float (minutes)", but it would be a
>> huge
>> (but potentially interesting) change for the way we look at data.  That
>> being said, as far as units go, we see a lot more unitless values
>> (quantity
>> of items, percents and other ratios, ratings).  The most frequent numeric
>> values with units that we see are probably money or geolocation (in
>> practice, already normalized to lat/long -- although I just learned about
>> UTM!).  Surprisingly, there's not as much SI-type unit data as you might
>> expect.
>>
>> I can definitely see the value of using a "unit" annotation in a generated
>> specific record for a supported language -- as proven by your scala work!
>> That might be an easy first target while working out what a first-class
>> concept in the spec would entail.  I missed Berlin Buzzwords by a day, but
>> enjoyed the video, thanks!
>>
>> Ryan
>>
>>
>>
>> On Tue, Jul 16, 2019 at 1:24 AM Erik Erlandson <eerla...@redhat.com>
>> wrote:
>>
>> > If I'm interpreting the situation correctly, there is an "Avro
>> Enhancement
>> > Proposal", but none have been filed in nearly a decade:
>> >
>> https://cwiki.apache.org/confluence/display/AVRO/Avro+Enhancement+Proposals
>> >
>> > As a start, I submitted a jira to track this idea:
>> > https://issues.apache.org/jira/browse/AVRO-2474
>> >
>> >
>> >
>> > On Mon, Jul 8, 2019 at 10:42 AM Erik Erlandson <eerla...@redhat.com>
>> > wrote:
>> >
>> > >
>> > > What should I do to move this forward? Does Avro have a PIP process?
>> > >
>> > >
>> > > On Sat, Jun 29, 2019 at 3:26 PM Erik Erlandson <eerla...@redhat.com>
>> > > wrote:
>> > >
>> > >>
>> > >> Regarding schema, my proposal for fingerprints would be that units
>> are
>> > >> fingerprinted based on their canonical form, as defined here
>> > >> <
>> >
>> http://erikerlandson.github.io/blog/2019/05/03/algorithmic-unit-analysis/
>> > >.
>> > >> Any two unit expressions having the same canonical form (including
>> the
>> > >> corresponding coefficients) are exactly equivalent, and so their
>> > >> fingerprints can be the same. Possibly the unit could be stored on
>> the
>> > >> schema in canonical form by convention, although canonical forms are
>> > >> frequently not as intuitive to humans and so in that case the
>> > documentation
>> > >> value of the unit might be reduced for humans examining the schema.
>> > >>
>> > >> For schema evolution, a unit change such that the previous and new
>> unit
>> > >> are convertable (also defined as at the above link) would be well
>> > defined,
>> > >> and automatic transformation would just be the correct unit
>> conversion
>> > >> (e.g. seconds to milliseconds). If the unit changes to a
>> non-convertable
>> > >> unit (e.g. seconds to bytes) then no automatic transformation exists,
>> > and
>> > >> attempting to resolve the old and new schema would be an error. Note
>> > that
>> > >> establishing the conversion assumes that both original and new
>> schemas
>> > are
>> > >> available at read time.
>> > >>
>> > >>
>> > >> On Sat, Jun 29, 2019 at 11:55 AM Niels Basjes <ni...@basj.es> wrote:
>> > >>
>> > >>> I think we should approach this idea in two parts:
>> > >>>
>> > >>> 1) The schema. Things like does a different unit mean a different
>> > schema
>> > >>> fingerprint even though the bytes remain the same. What does a
>> > different
>> > >>> unit mean for schema evolution.
>> > >>>
>> > >>> 2) Language specifics. Scala has different possibilities than Java.
>> > >>>
>> > >>> On Sat, Jun 29, 2019, 18:59 Erik Erlandson <eerla...@redhat.com>
>> > wrote:
>> > >>>
>> > >>> > I've been puzzling over what can be done to support this in more
>> > >>> > widely-used languages. The dilemma relative to the current
>> language
>> > >>> > ecosystem is that languages with "modern" type systems (Haskell,
>> > Rust,
>> > >>> > Scala, etc) capable of supporting compile-time unit checking, in
>> the
>> > >>> > particular style I've been exploring, are not yet widely used.
>> > >>> >
>> > >>> > With respect to Java, a couple approaches are plausible. One is to
>> > >>> enhance
>> > >>> > the language, for example with Java-8 compiler plugins. Another
>> might
>> > >>> be to
>> > >>> > implement a unit type system similar to squants
>> > >>> > <https://github.com/typelevel/squants>. This style of unit type
>> > >>> system is
>> > >>> > not as flexible or intuitive as what can be done with Scala's
>> latest
>> > >>> type
>> > >>> > system sorcery, but it would allow the community to build out a
>> Java
>> > >>> native
>> > >>> > type system that supports compile-time unit analysis. And its
>> > coverage
>> > >>> of
>> > >>> > standard units could be made very good, as squants itself
>> > demonstrates.
>> > >>> >
>> > >>> > Python would also be a high-coverage target. I'm even less sure
>> what
>> > >>> to do
>> > >>> > for python, as it has no compile-time type checking, but perhaps a
>> > >>> > squants-like python class system would add value. Maybe python's
>> new
>> > >>> > type-hints feature could be leveraged?
>> > >>> >
>> > >>> > Regarding unit expression representation, I'm not unhappy with
>> what
>> > >>> I've
>> > >>> > prototyped in `coulomb-avro`, in broad strokes. It has
>> deficiencies
>> > >>> that
>> > >>> > would need addressing. It doesn't yet support standard unit
>> > >>> abbreviations,
>> > >>> > nor does it understand plurals (e.g. it can parse "second" but not
>> > >>> > "seconds"). Since it's "unit" field is just a custom metadata key,
>> > >>> there is
>> > >>> > no enforcement. Parsers are currently instantiated via explicit
>> lists
>> > >>> of
>> > >>> > types, which is a property I like, but that may not work well in a
>> > >>> world
>> > >>> > where multiple language bindings must be supported in a portable
>> > >>> manner.
>> > >>> >
>> > >>> >
>> > >>> >
>> > >>> > On Sat, Jun 29, 2019 at 1:46 AM Niels Basjes <ni...@basj.es>
>> wrote:
>> > >>> >
>> > >>> > > Hi,
>> > >>> > >
>> > >>> > > I attended your talk in Berlin and at the end I thought "too bad
>> > >>> this is
>> > >>> > > only Scala".
>> > >>> > >
>> > >>> > > I think it's a good idea to have this in Avro.
>> > >>> > >
>> > >>> > > The details will be tricky: How to encode the units in the
>> schema
>> > for
>> > >>> > > example.
>> > >>> > > Especially because of the automatic conversion you spoke about.
>> > >>> > >
>> > >>> > > Niels
>> > >>> > >
>> > >>> > > On Fri, Jun 28, 2019, 23:58 Erik Erlandson <eerla...@redhat.com
>> >
>> > >>> wrote:
>> > >>> > >
>> > >>> > > > Hi Avro community,
>> > >>> > > >
>> > >>> > > > Recently I have been experimenting with avro schema that are
>> > >>> extended
>> > >>> > > with
>> > >>> > > > a "unit" field. By "unit" I mean expressions like "second", or
>> > >>> > > "megabyte" -
>> > >>> > > > that is "units of measure".
>> > >>> > > >
>> > >>> > > > I delivered a short talk on my experiments at Berlin
>> Buzzwords,
>> > >>> which
>> > >>> > can
>> > >>> > > > be viewed here:
>> > >>> > > > https://www.youtube.com/watch?v=qrQmB2KFKE8
>> > >>> > > > I also wrote a short blog post that may be faster to ingest:
>> > >>> > > >
>> > >>> > > >
>> > >>> > >
>> > >>> >
>> > >>>
>> >
>> http://erikerlandson.github.io/blog/2019/05/23/unit-types-for-avro-schema-integrating-avro-with-coulomb/
>> > >>> > > >
>> > >>> > > > I received some audience interest in making this concept
>> "first
>> > >>> class"
>> > >>> > > for
>> > >>> > > > avro, and so I'm writing to see what the avro dev community
>> > thinks
>> > >>> of
>> > >>> > the
>> > >>> > > > idea. One issue is that this kind of unit checking is
>> currently
>> > >>> only
>> > >>> > > > available for Scala (and specifically scala 2.13 +).
>> > >>> > > >
>> > >>> > > > The Scala project itself is here:
>> > >>> > > > https://github.com/erikerlandson/coulomb
>> > >>> > > >
>> > >>> > > > Cheers,
>> > >>> > > > Erik
>> > >>> > > >
>> > >>> > >
>> > >>> >
>> > >>>
>> > >>
>> >
>>
>

Reply via email to