[ 
https://issues.apache.org/jira/browse/AVRO-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886436#comment-16886436
 ] 

Erik Erlandson commented on AVRO-2474:
--------------------------------------

Questions & comments from copied from Ryan Skraba, via email thread:

 
{quote}Would you expect a writer schema that wrote a Fahrenheit field and a
 reader schema that reads Celsius to interact transparently with generic
 data?
{quote}
(With the caveat that there is a lot I don't know about Avro) IIUC, the writer 
schema is saved with written data. So if a writer schema had 
"unit":"fahrenheit", and on input a reader schema had "unit":"celsius", then I 
would expect this to be detected and correctly converted, transparently.

Tangentially: "fahrenheit", "celsius" & "kelvin" are interesting because they 
might denote either a "unit" (a quantity of degrees) or an actual "temperature" 
(having a particular offset). In general, "temperature" is not the same thing 
as a "unit" of degrees, see 
[here|https://github.com/erikerlandson/coulomb#temperature-values]. The upshot 
is that there will be "temperature" attributes on a schema as well as "unit" of 
degrees. Similarly there will be "timestamp" and/or "date", as well as "unit" 
of time, although that is something I haven't added to coulomb yet.
{quote}What about conversions that lose precision (i.e., if the above conversion
 was on an INT field)
{quote}
This is an great question; there is not one obvious policy. In coulomb, my 
default conversion policy is "best effort", which I 
[define|https://github.com/erikerlandson/coulomb/blob/develop/coulomb/src/main/scala/coulomb/unitops/unitops.scala#L183]
 as: translate the input and conversion factor to Rational, multiply, and then 
convert to Integer (or Long, etc). The tradeoff here is some compute cost. 
Other policies could be defined that are faster (and maybe even more aligned 
with standard float to int interactions). I do not have a strong opinion on 
this. I think I'd want to do whatever is most intuitive for most members of the 
community.
{quote}How much of "unit" support should be mandatory in the spec for cross 
language operation?  (a unit-aware Scala writer with a Fahrenheit field and a 
non-unit-aware reader with a Celsius field) To what degree would a generic 
reader of Avro data be required to support quantity wrappers (i.e. how can we 
opt-in/opt-out cleanly from being unit-aware)?
{quote}
If the necessary "unit" information is present on both the write schema and 
reader schema, then I believe this might be made "mandatory" across languages. 
The values themselves (in the code) might not have any unit types attached (as 
I support with coulomb), but the unit fields on the schema could be checked for 
compatibility and converted. In that sense, we might make actual unit-types in 
a language optional. This might be a way to provide meaningful support for 
language that either can't or don't yet support a concept of unit type in the 
code itself. My tentative idea for a policy on this is: if data is written 
using a schema with units, then either the reader-schema must also provide a 
compatible unit, and/or the code must somehow specify the requested unit. 
Otherwise, a read error will be raised.
{quote}At scale, I'd be particularly keen to see the conversion detection 
(between
 two schemas / fields / quantities) take place once, and then the
 calculation reused for all of the subsequent datum passing through, but I'm
 not sure how that would work
{quote}
In my current implementation, a unit conversion factor is computed once, and 
then it is cached on the schema itself, and detected via key lookup on 
subsequent reads. This actually was not nearly as slow as I'd feared, but it is 
still an extra key lookup per read. When the unit is coming from the read call 
(as I currently do it), I am not very sure how to do better. If the write and 
read schema are being resolved in the avro system itself, I can imagine better 
performance, equivalent to just checking a boolean per read. You can see what 
the current code does 
[here|https://github.com/erikerlandson/coulomb/blob/develop/coulomb-avro/src/main/scala/coulomb/avro/package.scala#L51].
 I'm optimistic that Avro schema dev community might have good ideas here.
{quote}(quantity of items, percents and other ratios, ratings)
{quote}
My possibly-controvertial position is that "items", percents and ratings *do* 
have implied units. Ratios of course are likely to be truly unitless, although 
measures such angular degrees, radians, etc, are useful units that are secretly 
derived from "Unitless".

That is, we deal every day with many kinds of units, but because they are not 
SI-units, we are trained to not think of them as unit quantities. As a simple 
example, the length of a vector of Products is an integer value with unit 
"Product", and the length of a vector of People is an integer having unit 
"People", etc. In my talk, I made the same observation (very briefly) about 
kube objects like Node or Pod.

We always treat these values as unitless, but I'm increasingly convinced this 
is leaving useful information on the table. We don't do it because it hasn't 
been possible.

 

> Support a "unit" property of schema fields
> ------------------------------------------
>
>                 Key: AVRO-2474
>                 URL: https://issues.apache.org/jira/browse/AVRO-2474
>             Project: Apache Avro
>          Issue Type: Improvement
>          Components: spec
>    Affects Versions: 1.9.0
>            Reporter: Erik Erlandson
>            Priority: Major
>
> Recently I have been experimenting with avro schema that are extended with a 
> "unit" field. By "unit" I mean expressions like "second", or "megabyte" - 
> that is "units of measure".
>  
> I received some community interest in making this concept "first class" for 
> avro; I'm filing this JIRA to track the idea. 
>  
> I delivered a short talk on my experiments at Berlin Buzzwords, which can be 
> viewed here:
> [https://www.youtube.com/watch?v=qrQmB2KFKE8]
>  
> I also wrote a short blog post that may be faster to ingest:
> [http://erikerlandson.github.io/blog/2019/05/23/unit-types-for-avro-schema-integrating-avro-with-coulomb/]
>  
> The project itself is here:
> [https://github.com/erikerlandson/coulomb]
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to