Hello!  I've been thinking about this and I generally like the idea of
stronger types with units :D

I have some questions about what you are thinking of when you say "first
class concept" in Avro:
- Would you expect a writer schema that wrote a Fahrenheit field and a
reader schema that reads Celsius to interact transparently with generic
data?
- What about conversions that lose precision (i.e., if the above conversion
was on an INT field)?
- How much of "unit" support should be mandatory in the spec for cross
language operation?  (a unit-aware Scala writer with a Fahrenheit field and
a non-unit-aware reader with a Celsius field).
- To what degree would a generic reader of Avro data be required to support
quantity wrappers (i.e. how can we opt-in/opt-out cleanly from being
unit-aware)?

At scale, I'd be particularly keen to see the conversion detection (between
two schemas / fields / quantities) take place once, and then the
calculation reused for all of the subsequent datum passing through, but I'm
not sure how that would work.

We have some experience with passing a lot of client data through Avro, and
we use generic data quite a bit -- I'd be tempted to think of "float
(metres)" as a distinct type from "float (minutes)", but it would be a huge
(but potentially interesting) change for the way we look at data.  That
being said, as far as units go, we see a lot more unitless values (quantity
of items, percents and other ratios, ratings).  The most frequent numeric
values with units that we see are probably money or geolocation (in
practice, already normalized to lat/long -- although I just learned about
UTM!).  Surprisingly, there's not as much SI-type unit data as you might
expect.

I can definitely see the value of using a "unit" annotation in a generated
specific record for a supported language -- as proven by your scala work!
That might be an easy first target while working out what a first-class
concept in the spec would entail.  I missed Berlin Buzzwords by a day, but
enjoyed the video, thanks!

Ryan



On Tue, Jul 16, 2019 at 1:24 AM Erik Erlandson <eerla...@redhat.com> wrote:

> If I'm interpreting the situation correctly, there is an "Avro Enhancement
> Proposal", but none have been filed in nearly a decade:
> https://cwiki.apache.org/confluence/display/AVRO/Avro+Enhancement+Proposals
>
> As a start, I submitted a jira to track this idea:
> https://issues.apache.org/jira/browse/AVRO-2474
>
>
>
> On Mon, Jul 8, 2019 at 10:42 AM Erik Erlandson <eerla...@redhat.com>
> wrote:
>
> >
> > What should I do to move this forward? Does Avro have a PIP process?
> >
> >
> > On Sat, Jun 29, 2019 at 3:26 PM Erik Erlandson <eerla...@redhat.com>
> > wrote:
> >
> >>
> >> Regarding schema, my proposal for fingerprints would be that units are
> >> fingerprinted based on their canonical form, as defined here
> >> <
> http://erikerlandson.github.io/blog/2019/05/03/algorithmic-unit-analysis/
> >.
> >> Any two unit expressions having the same canonical form (including the
> >> corresponding coefficients) are exactly equivalent, and so their
> >> fingerprints can be the same. Possibly the unit could be stored on the
> >> schema in canonical form by convention, although canonical forms are
> >> frequently not as intuitive to humans and so in that case the
> documentation
> >> value of the unit might be reduced for humans examining the schema.
> >>
> >> For schema evolution, a unit change such that the previous and new unit
> >> are convertable (also defined as at the above link) would be well
> defined,
> >> and automatic transformation would just be the correct unit conversion
> >> (e.g. seconds to milliseconds). If the unit changes to a non-convertable
> >> unit (e.g. seconds to bytes) then no automatic transformation exists,
> and
> >> attempting to resolve the old and new schema would be an error. Note
> that
> >> establishing the conversion assumes that both original and new schemas
> are
> >> available at read time.
> >>
> >>
> >> On Sat, Jun 29, 2019 at 11:55 AM Niels Basjes <ni...@basj.es> wrote:
> >>
> >>> I think we should approach this idea in two parts:
> >>>
> >>> 1) The schema. Things like does a different unit mean a different
> schema
> >>> fingerprint even though the bytes remain the same. What does a
> different
> >>> unit mean for schema evolution.
> >>>
> >>> 2) Language specifics. Scala has different possibilities than Java.
> >>>
> >>> On Sat, Jun 29, 2019, 18:59 Erik Erlandson <eerla...@redhat.com>
> wrote:
> >>>
> >>> > I've been puzzling over what can be done to support this in more
> >>> > widely-used languages. The dilemma relative to the current language
> >>> > ecosystem is that languages with "modern" type systems (Haskell,
> Rust,
> >>> > Scala, etc) capable of supporting compile-time unit checking, in the
> >>> > particular style I've been exploring, are not yet widely used.
> >>> >
> >>> > With respect to Java, a couple approaches are plausible. One is to
> >>> enhance
> >>> > the language, for example with Java-8 compiler plugins. Another might
> >>> be to
> >>> > implement a unit type system similar to squants
> >>> > <https://github.com/typelevel/squants>. This style of unit type
> >>> system is
> >>> > not as flexible or intuitive as what can be done with Scala's latest
> >>> type
> >>> > system sorcery, but it would allow the community to build out a Java
> >>> native
> >>> > type system that supports compile-time unit analysis. And its
> coverage
> >>> of
> >>> > standard units could be made very good, as squants itself
> demonstrates.
> >>> >
> >>> > Python would also be a high-coverage target. I'm even less sure what
> >>> to do
> >>> > for python, as it has no compile-time type checking, but perhaps a
> >>> > squants-like python class system would add value. Maybe python's new
> >>> > type-hints feature could be leveraged?
> >>> >
> >>> > Regarding unit expression representation, I'm not unhappy with what
> >>> I've
> >>> > prototyped in `coulomb-avro`, in broad strokes. It has deficiencies
> >>> that
> >>> > would need addressing. It doesn't yet support standard unit
> >>> abbreviations,
> >>> > nor does it understand plurals (e.g. it can parse "second" but not
> >>> > "seconds"). Since it's "unit" field is just a custom metadata key,
> >>> there is
> >>> > no enforcement. Parsers are currently instantiated via explicit lists
> >>> of
> >>> > types, which is a property I like, but that may not work well in a
> >>> world
> >>> > where multiple language bindings must be supported in a portable
> >>> manner.
> >>> >
> >>> >
> >>> >
> >>> > On Sat, Jun 29, 2019 at 1:46 AM Niels Basjes <ni...@basj.es> wrote:
> >>> >
> >>> > > Hi,
> >>> > >
> >>> > > I attended your talk in Berlin and at the end I thought "too bad
> >>> this is
> >>> > > only Scala".
> >>> > >
> >>> > > I think it's a good idea to have this in Avro.
> >>> > >
> >>> > > The details will be tricky: How to encode the units in the schema
> for
> >>> > > example.
> >>> > > Especially because of the automatic conversion you spoke about.
> >>> > >
> >>> > > Niels
> >>> > >
> >>> > > On Fri, Jun 28, 2019, 23:58 Erik Erlandson <eerla...@redhat.com>
> >>> wrote:
> >>> > >
> >>> > > > Hi Avro community,
> >>> > > >
> >>> > > > Recently I have been experimenting with avro schema that are
> >>> extended
> >>> > > with
> >>> > > > a "unit" field. By "unit" I mean expressions like "second", or
> >>> > > "megabyte" -
> >>> > > > that is "units of measure".
> >>> > > >
> >>> > > > I delivered a short talk on my experiments at Berlin Buzzwords,
> >>> which
> >>> > can
> >>> > > > be viewed here:
> >>> > > > https://www.youtube.com/watch?v=qrQmB2KFKE8
> >>> > > > I also wrote a short blog post that may be faster to ingest:
> >>> > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> http://erikerlandson.github.io/blog/2019/05/23/unit-types-for-avro-schema-integrating-avro-with-coulomb/
> >>> > > >
> >>> > > > I received some audience interest in making this concept "first
> >>> class"
> >>> > > for
> >>> > > > avro, and so I'm writing to see what the avro dev community
> thinks
> >>> of
> >>> > the
> >>> > > > idea. One issue is that this kind of unit checking is currently
> >>> only
> >>> > > > available for Scala (and specifically scala 2.13 +).
> >>> > > >
> >>> > > > The Scala project itself is here:
> >>> > > > https://github.com/erikerlandson/coulomb
> >>> > > >
> >>> > > > Cheers,
> >>> > > > Erik
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>
>

Reply via email to