[jira] [Commented] (AVRO-2474) Support a "unit" property of schema fields

2021-03-10 Thread Ryan Skraba (Jira)


[ 
https://issues.apache.org/jira/browse/AVRO-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298622#comment-17298622
 ] 

Ryan Skraba commented on AVRO-2474:
---

I'll bring this up on the mailing list again -- there was an interesting 
discussion on reviving the AEP process [back in 
April|https://lists.apache.org/thread.html/r9ec7d8801186d3242e6d535adb547ba5068f5a4e0202ec1bd5d8912a%40%3Cdev.avro.apache.org%3E]
 and, given the timing, we obviously should have linked the two together.

> Support a "unit" property of schema fields
> --
>
> Key: AVRO-2474
> URL: https://issues.apache.org/jira/browse/AVRO-2474
> Project: Apache Avro
>  Issue Type: Improvement
>  Components: spec
>Affects Versions: 1.9.0
>Reporter: Erik Erlandson
>Priority: Major
>  Labels: aep
>
> Recently I have been experimenting with avro schema that are extended with a 
> "unit" field. By "unit" I mean expressions like "second", or "megabyte" - 
> that is "units of measure".
>  
> I received some community interest in making this concept "first class" for 
> avro; I'm filing this JIRA to track the idea. 
>  
> I delivered a short talk on my experiments at Berlin Buzzwords, which can be 
> viewed here:
> [https://www.youtube.com/watch?v=qrQmB2KFKE8]
>  
> I also wrote a short blog post that may be faster to ingest:
> [http://erikerlandson.github.io/blog/2019/05/23/unit-types-for-avro-schema-integrating-avro-with-coulomb/]
>  
> The project itself is here:
> [https://github.com/erikerlandson/coulomb]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AVRO-2474) Support a "unit" property of schema fields

2021-03-08 Thread Erik Erlandson (Jira)


[ 
https://issues.apache.org/jira/browse/AVRO-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297494#comment-17297494
 ] 

Erik Erlandson commented on AVRO-2474:
--

[~rskraba]  I am definitely continuing to work on coulomb.  The next big 
development push will be the move to scala-3.

I submitted a draft AEP: 
[https://docs.google.com/document/d/1IeVAtf6YcAAn35D4jmFQJjPpEMgEu79wWVMW37KvNps/edit]

As far as I know, there was no feedback on it, or voting.

I like the idea of being able to consume support for unit analysis as a 
plug-in.  If people are interested, I can look into it.

> Support a "unit" property of schema fields
> --
>
> Key: AVRO-2474
> URL: https://issues.apache.org/jira/browse/AVRO-2474
> Project: Apache Avro
>  Issue Type: Improvement
>  Components: spec
>Affects Versions: 1.9.0
>Reporter: Erik Erlandson
>Priority: Major
>  Labels: aep
>
> Recently I have been experimenting with avro schema that are extended with a 
> "unit" field. By "unit" I mean expressions like "second", or "megabyte" - 
> that is "units of measure".
>  
> I received some community interest in making this concept "first class" for 
> avro; I'm filing this JIRA to track the idea. 
>  
> I delivered a short talk on my experiments at Berlin Buzzwords, which can be 
> viewed here:
> [https://www.youtube.com/watch?v=qrQmB2KFKE8]
>  
> I also wrote a short blog post that may be faster to ingest:
> [http://erikerlandson.github.io/blog/2019/05/23/unit-types-for-avro-schema-integrating-avro-with-coulomb/]
>  
> The project itself is here:
> [https://github.com/erikerlandson/coulomb]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AVRO-2474) Support a "unit" property of schema fields

2021-03-08 Thread Ryan Skraba (Jira)


[ 
https://issues.apache.org/jira/browse/AVRO-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297252#comment-17297252
 ] 

Ryan Skraba commented on AVRO-2474:
---

Time in source status: 599d 16h 35m  :/

I want to say that I appreciate the engineering work that went into the 
implementation and the proposal, and I watched the video.  Is development work 
continuing on [Coulomb|https://github.com/erikerlandson/coulomb]?  It's a 
project that deserves a shout out!

On our end, and in my experience, unit analysis and metadata in the schema 
doesn't fit well with our company's use of Avro for persistence and data 
transfer.  We do things like "semantic typing" (some units and other 
categories) and filtering, and I've talked about your proposal internally as a 
future work, but as it stands, we just do all of our work in our toolkit on top 
of Avro.

Do you think we could put this work in the wIki as [AEP 
104|https://cwiki.apache.org/confluence/display/AVRO/Avro+Enhancement+Proposals]
 with links to Coulomb and this JIRA for future reference?  Was it ever voted 
on?  I could only find this [original 
discussion|https://lists.apache.org/x/list.html?dev@avro.apache.org:gte=1d:units].

Another idea -- there's a different JIRA AVRO-2952 that also adds a lot of 
custom processing (for DI-like annotations).  It might be worthwhile taking a 
look to see what we would need to be able to specify things like "units" and 
"di-annotations" as an opt-in part of the spec with some sort of extension 
framework or entrypoint.  If it were "pluggable" instead of part of core, it 
would be easier to adopt and innovate.

> Support a "unit" property of schema fields
> --
>
> Key: AVRO-2474
> URL: https://issues.apache.org/jira/browse/AVRO-2474
> Project: Apache Avro
>  Issue Type: Improvement
>  Components: spec
>Affects Versions: 1.9.0
>Reporter: Erik Erlandson
>Priority: Major
>  Labels: aep
>
> Recently I have been experimenting with avro schema that are extended with a 
> "unit" field. By "unit" I mean expressions like "second", or "megabyte" - 
> that is "units of measure".
>  
> I received some community interest in making this concept "first class" for 
> avro; I'm filing this JIRA to track the idea. 
>  
> I delivered a short talk on my experiments at Berlin Buzzwords, which can be 
> viewed here:
> [https://www.youtube.com/watch?v=qrQmB2KFKE8]
>  
> I also wrote a short blog post that may be faster to ingest:
> [http://erikerlandson.github.io/blog/2019/05/23/unit-types-for-avro-schema-integrating-avro-with-coulomb/]
>  
> The project itself is here:
> [https://github.com/erikerlandson/coulomb]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AVRO-2474) Support a "unit" property of schema fields

2020-07-23 Thread Erik Erlandson (Jira)


[ 
https://issues.apache.org/jira/browse/AVRO-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163837#comment-17163837
 ] 

Erik Erlandson commented on AVRO-2474:
--

[~iemejia]  [~rskraba]  how should this proceed?

 

> Support a "unit" property of schema fields
> --
>
> Key: AVRO-2474
> URL: https://issues.apache.org/jira/browse/AVRO-2474
> Project: Apache Avro
>  Issue Type: Improvement
>  Components: spec
>Affects Versions: 1.9.0
>Reporter: Erik Erlandson
>Priority: Major
>  Labels: aep
>
> Recently I have been experimenting with avro schema that are extended with a 
> "unit" field. By "unit" I mean expressions like "second", or "megabyte" - 
> that is "units of measure".
>  
> I received some community interest in making this concept "first class" for 
> avro; I'm filing this JIRA to track the idea. 
>  
> I delivered a short talk on my experiments at Berlin Buzzwords, which can be 
> viewed here:
> [https://www.youtube.com/watch?v=qrQmB2KFKE8]
>  
> I also wrote a short blog post that may be faster to ingest:
> [http://erikerlandson.github.io/blog/2019/05/23/unit-types-for-avro-schema-integrating-avro-with-coulomb/]
>  
> The project itself is here:
> [https://github.com/erikerlandson/coulomb]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AVRO-2474) Support a "unit" property of schema fields

2020-05-26 Thread Erik Erlandson (Jira)


[ 
https://issues.apache.org/jira/browse/AVRO-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116783#comment-17116783
 ] 

Erik Erlandson commented on AVRO-2474:
--

[~rskraba]  [~iemejia]   I agree that this feature ought to be voted on, or 
otherwise formally discussed by the community. It involves a few hundred lines 
of new code (per language).  I submitted pr #841 so that the community could 
see how this feature would work, and what the implementation entails, etc, so 
people have something "real" to make a decision with, instead of voting on an 
abstract AEP doc.

It involves more than just code - as you can see in the PR I need to define a 
JSON (sub)schema for expressing units, and while I'm happy with the shape of 
the current proposal, there are multiple design choices that might be made here.

Lastly, adopting this implies committing to multiple language implementations, 
not just python. The good news is that it can easily enough be implemented on a 
per-language basis, but eventually implementations would be needed for at least 
some popular subset of the avro language bindings (I'm guessing at least 
python, java and c++).

I do not know what the timeline for 1.10 branch-cut is.  In theory I could 
massage #841 into a merge-able state fairly quickly, but it is a high-impact 
feature and I don't really want to rush it.

> Support a "unit" property of schema fields
> --
>
> Key: AVRO-2474
> URL: https://issues.apache.org/jira/browse/AVRO-2474
> Project: Apache Avro
>  Issue Type: Improvement
>  Components: spec
>Affects Versions: 1.9.0
>Reporter: Erik Erlandson
>Priority: Major
>  Labels: aep
>
> Recently I have been experimenting with avro schema that are extended with a 
> "unit" field. By "unit" I mean expressions like "second", or "megabyte" - 
> that is "units of measure".
>  
> I received some community interest in making this concept "first class" for 
> avro; I'm filing this JIRA to track the idea. 
>  
> I delivered a short talk on my experiments at Berlin Buzzwords, which can be 
> viewed here:
> [https://www.youtube.com/watch?v=qrQmB2KFKE8]
>  
> I also wrote a short blog post that may be faster to ingest:
> [http://erikerlandson.github.io/blog/2019/05/23/unit-types-for-avro-schema-integrating-avro-with-coulomb/]
>  
> The project itself is here:
> [https://github.com/erikerlandson/coulomb]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AVRO-2474) Support a "unit" property of schema fields

2020-05-26 Thread Jira


[ 
https://issues.apache.org/jira/browse/AVRO-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116572#comment-17116572
 ] 

Ismaël Mejía commented on AVRO-2474:


I think since this is an Avro Enhacement Proposal (aep label) we require too of 
consensus on the feature so also worth to discuss (and vote) that in the 
mailing list.

> Support a "unit" property of schema fields
> --
>
> Key: AVRO-2474
> URL: https://issues.apache.org/jira/browse/AVRO-2474
> Project: Apache Avro
>  Issue Type: Improvement
>  Components: spec
>Affects Versions: 1.9.0
>Reporter: Erik Erlandson
>Priority: Major
>  Labels: aep
>
> Recently I have been experimenting with avro schema that are extended with a 
> "unit" field. By "unit" I mean expressions like "second", or "megabyte" - 
> that is "units of measure".
>  
> I received some community interest in making this concept "first class" for 
> avro; I'm filing this JIRA to track the idea. 
>  
> I delivered a short talk on my experiments at Berlin Buzzwords, which can be 
> viewed here:
> [https://www.youtube.com/watch?v=qrQmB2KFKE8]
>  
> I also wrote a short blog post that may be faster to ingest:
> [http://erikerlandson.github.io/blog/2019/05/23/unit-types-for-avro-schema-integrating-avro-with-coulomb/]
>  
> The project itself is here:
> [https://github.com/erikerlandson/coulomb]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AVRO-2474) Support a "unit" property of schema fields

2020-05-26 Thread Ryan Skraba (Jira)


[ 
https://issues.apache.org/jira/browse/AVRO-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116571#comment-17116571
 ] 

Ryan Skraba commented on AVRO-2474:
---

Hello!  I've removed the 1.10.0 fix target for this new feature.  Is that OK?  
This is a pretty major (and neat) feature to add and it doesn't look like it's 
going to be ready for when we cut the branch...

I think we'd all love to see new and interesting features move forward.  Maybe 
it would be a good idea to create some actionable subtasks that can be 
completed progressively?

> Support a "unit" property of schema fields
> --
>
> Key: AVRO-2474
> URL: https://issues.apache.org/jira/browse/AVRO-2474
> Project: Apache Avro
>  Issue Type: Improvement
>  Components: spec
>Affects Versions: 1.9.0
>Reporter: Erik Erlandson
>Priority: Major
>
> Recently I have been experimenting with avro schema that are extended with a 
> "unit" field. By "unit" I mean expressions like "second", or "megabyte" - 
> that is "units of measure".
>  
> I received some community interest in making this concept "first class" for 
> avro; I'm filing this JIRA to track the idea. 
>  
> I delivered a short talk on my experiments at Berlin Buzzwords, which can be 
> viewed here:
> [https://www.youtube.com/watch?v=qrQmB2KFKE8]
>  
> I also wrote a short blog post that may be faster to ingest:
> [http://erikerlandson.github.io/blog/2019/05/23/unit-types-for-avro-schema-integrating-avro-with-coulomb/]
>  
> The project itself is here:
> [https://github.com/erikerlandson/coulomb]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AVRO-2474) Support a "unit" property of schema fields

2019-09-17 Thread Erik Erlandson (Jira)


[ 
https://issues.apache.org/jira/browse/AVRO-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931958#comment-16931958
 ] 

Erik Erlandson commented on AVRO-2474:
--

I have written up a draft of an Avro Enhancement Proposal, that describes a 
roadmap where unit expression parsing conversion is done as a part of resolving 
reader schemas with writer schemas:

[Avro Enhancement Proposal (AEP): Unit 
Metadata|https://docs.google.com/document/d/1IeVAtf6YcAAn35D4jmFQJjPpEMgEu79wWVMW37KvNps/]

 

> Support a "unit" property of schema fields
> --
>
> Key: AVRO-2474
> URL: https://issues.apache.org/jira/browse/AVRO-2474
> Project: Apache Avro
>  Issue Type: Improvement
>  Components: spec
>Affects Versions: 1.9.0
>Reporter: Erik Erlandson
>Priority: Major
>
> Recently I have been experimenting with avro schema that are extended with a 
> "unit" field. By "unit" I mean expressions like "second", or "megabyte" - 
> that is "units of measure".
>  
> I received some community interest in making this concept "first class" for 
> avro; I'm filing this JIRA to track the idea. 
>  
> I delivered a short talk on my experiments at Berlin Buzzwords, which can be 
> viewed here:
> [https://www.youtube.com/watch?v=qrQmB2KFKE8]
>  
> I also wrote a short blog post that may be faster to ingest:
> [http://erikerlandson.github.io/blog/2019/05/23/unit-types-for-avro-schema-integrating-avro-with-coulomb/]
>  
> The project itself is here:
> [https://github.com/erikerlandson/coulomb]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AVRO-2474) Support a "unit" property of schema fields

2019-07-16 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/AVRO-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886436#comment-16886436
 ] 

Erik Erlandson commented on AVRO-2474:
--

Questions & comments from copied from Ryan Skraba, via email thread:

 
{quote}Would you expect a writer schema that wrote a Fahrenheit field and a
 reader schema that reads Celsius to interact transparently with generic
 data?
{quote}
(With the caveat that there is a lot I don't know about Avro) IIUC, the writer 
schema is saved with written data. So if a writer schema had 
"unit":"fahrenheit", and on input a reader schema had "unit":"celsius", then I 
would expect this to be detected and correctly converted, transparently.

Tangentially: "fahrenheit", "celsius" & "kelvin" are interesting because they 
might denote either a "unit" (a quantity of degrees) or an actual "temperature" 
(having a particular offset). In general, "temperature" is not the same thing 
as a "unit" of degrees, see 
[here|https://github.com/erikerlandson/coulomb#temperature-values]. The upshot 
is that there will be "temperature" attributes on a schema as well as "unit" of 
degrees. Similarly there will be "timestamp" and/or "date", as well as "unit" 
of time, although that is something I haven't added to coulomb yet.
{quote}What about conversions that lose precision (i.e., if the above conversion
 was on an INT field)
{quote}
This is an great question; there is not one obvious policy. In coulomb, my 
default conversion policy is "best effort", which I 
[define|https://github.com/erikerlandson/coulomb/blob/develop/coulomb/src/main/scala/coulomb/unitops/unitops.scala#L183]
 as: translate the input and conversion factor to Rational, multiply, and then 
convert to Integer (or Long, etc). The tradeoff here is some compute cost. 
Other policies could be defined that are faster (and maybe even more aligned 
with standard float to int interactions). I do not have a strong opinion on 
this. I think I'd want to do whatever is most intuitive for most members of the 
community.
{quote}How much of "unit" support should be mandatory in the spec for cross 
language operation?  (a unit-aware Scala writer with a Fahrenheit field and a 
non-unit-aware reader with a Celsius field) To what degree would a generic 
reader of Avro data be required to support quantity wrappers (i.e. how can we 
opt-in/opt-out cleanly from being unit-aware)?
{quote}
If the necessary "unit" information is present on both the write schema and 
reader schema, then I believe this might be made "mandatory" across languages. 
The values themselves (in the code) might not have any unit types attached (as 
I support with coulomb), but the unit fields on the schema could be checked for 
compatibility and converted. In that sense, we might make actual unit-types in 
a language optional. This might be a way to provide meaningful support for 
language that either can't or don't yet support a concept of unit type in the 
code itself. My tentative idea for a policy on this is: if data is written 
using a schema with units, then either the reader-schema must also provide a 
compatible unit, and/or the code must somehow specify the requested unit. 
Otherwise, a read error will be raised.
{quote}At scale, I'd be particularly keen to see the conversion detection 
(between
 two schemas / fields / quantities) take place once, and then the
 calculation reused for all of the subsequent datum passing through, but I'm
 not sure how that would work
{quote}
In my current implementation, a unit conversion factor is computed once, and 
then it is cached on the schema itself, and detected via key lookup on 
subsequent reads. This actually was not nearly as slow as I'd feared, but it is 
still an extra key lookup per read. When the unit is coming from the read call 
(as I currently do it), I am not very sure how to do better. If the write and 
read schema are being resolved in the avro system itself, I can imagine better 
performance, equivalent to just checking a boolean per read. You can see what 
the current code does 
[here|https://github.com/erikerlandson/coulomb/blob/develop/coulomb-avro/src/main/scala/coulomb/avro/package.scala#L51].
 I'm optimistic that Avro schema dev community might have good ideas here.
{quote}(quantity of items, percents and other ratios, ratings)
{quote}
My possibly-controvertial position is that "items", percents and ratings *do* 
have implied units. Ratios of course are likely to be truly unitless, although 
measures such angular degrees, radians, etc, are useful units that are secretly 
derived from "Unitless".

That is, we deal every day with many kinds of units, but because they are not 
SI-units, we are trained to not think of them as unit quantities. As a simple 
example, the length of a vector of Products is an integer value with unit 
"Product", and the length of a vector of People is an integer having unit 
"People", etc. In my 

[jira] [Commented] (AVRO-2474) Support a "unit" property of schema fields

2019-07-15 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/AVRO-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885691#comment-16885691
 ] 

Erik Erlandson commented on AVRO-2474:
--

(copied from email thread)

 
Regarding schema, my proposal for fingerprints would be that units are 
fingerprinted based on their canonical form, as [defined 
here|http://erikerlandson.github.io/blog/2019/05/03/algorithmic-unit-analysis/].
 Any two unit expressions having the same canonical form (including the 
corresponding coefficients) are exactly equivalent, and so their fingerprints 
can be the same. Possibly the unit could be stored on the schema in canonical 
form by convention, although canonical forms are frequently not as intuitive to 
humans and so in that case the documentation value of the unit might be reduced 
for humans examining the schema.
 
For schema evolution, a unit change such that the previous and new unit are 
convertable (also defined as at the above link) would be well defined, and 
automatic transformation would just be the correct unit conversion (e.g. 
seconds to milliseconds). If the unit changes to a non-convertable unit (e.g. 
seconds to bytes) then no automatic transformation exists, and attempting to 
resolve the old and new schema would be an error. Note that establishing the 
conversion assumes that both original and new schemas are  available at read 
time.
 

> Support a "unit" property of schema fields
> --
>
> Key: AVRO-2474
> URL: https://issues.apache.org/jira/browse/AVRO-2474
> Project: Apache Avro
>  Issue Type: Improvement
>  Components: spec
>Affects Versions: 1.9.0
>Reporter: Erik Erlandson
>Priority: Major
>
> Recently I have been experimenting with avro schema that are extended with a 
> "unit" field. By "unit" I mean expressions like "second", or "megabyte" - 
> that is "units of measure".
>  
> I received some community interest in making this concept "first class" for 
> avro; I'm filing this JIRA to track the idea. 
>  
> I delivered a short talk on my experiments at Berlin Buzzwords, which can be 
> viewed here:
> [https://www.youtube.com/watch?v=qrQmB2KFKE8]
>  
> I also wrote a short blog post that may be faster to ingest:
> [http://erikerlandson.github.io/blog/2019/05/23/unit-types-for-avro-schema-integrating-avro-with-coulomb/]
>  
> The project itself is here:
> [https://github.com/erikerlandson/coulomb]
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)