[jira] [Commented] (AVRO-2474) Support a "unit" property of schema fields
[ https://issues.apache.org/jira/browse/AVRO-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298622#comment-17298622 ] Ryan Skraba commented on AVRO-2474: --- I'll bring this up on the mailing list again -- there was an interesting discussion on reviving the AEP process [back in April|https://lists.apache.org/thread.html/r9ec7d8801186d3242e6d535adb547ba5068f5a4e0202ec1bd5d8912a%40%3Cdev.avro.apache.org%3E] and, given the timing, we obviously should have linked the two together. > Support a "unit" property of schema fields > -- > > Key: AVRO-2474 > URL: https://issues.apache.org/jira/browse/AVRO-2474 > Project: Apache Avro > Issue Type: Improvement > Components: spec >Affects Versions: 1.9.0 >Reporter: Erik Erlandson >Priority: Major > Labels: aep > > Recently I have been experimenting with avro schema that are extended with a > "unit" field. By "unit" I mean expressions like "second", or "megabyte" - > that is "units of measure". > > I received some community interest in making this concept "first class" for > avro; I'm filing this JIRA to track the idea. > > I delivered a short talk on my experiments at Berlin Buzzwords, which can be > viewed here: > [https://www.youtube.com/watch?v=qrQmB2KFKE8] > > I also wrote a short blog post that may be faster to ingest: > [http://erikerlandson.github.io/blog/2019/05/23/unit-types-for-avro-schema-integrating-avro-with-coulomb/] > > The project itself is here: > [https://github.com/erikerlandson/coulomb] > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (AVRO-2474) Support a "unit" property of schema fields
[ https://issues.apache.org/jira/browse/AVRO-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297494#comment-17297494 ] Erik Erlandson commented on AVRO-2474: -- [~rskraba] I am definitely continuing to work on coulomb. The next big development push will be the move to scala-3. I submitted a draft AEP: [https://docs.google.com/document/d/1IeVAtf6YcAAn35D4jmFQJjPpEMgEu79wWVMW37KvNps/edit] As far as I know, there was no feedback on it, or voting. I like the idea of being able to consume support for unit analysis as a plug-in. If people are interested, I can look into it. > Support a "unit" property of schema fields > -- > > Key: AVRO-2474 > URL: https://issues.apache.org/jira/browse/AVRO-2474 > Project: Apache Avro > Issue Type: Improvement > Components: spec >Affects Versions: 1.9.0 >Reporter: Erik Erlandson >Priority: Major > Labels: aep > > Recently I have been experimenting with avro schema that are extended with a > "unit" field. By "unit" I mean expressions like "second", or "megabyte" - > that is "units of measure". > > I received some community interest in making this concept "first class" for > avro; I'm filing this JIRA to track the idea. > > I delivered a short talk on my experiments at Berlin Buzzwords, which can be > viewed here: > [https://www.youtube.com/watch?v=qrQmB2KFKE8] > > I also wrote a short blog post that may be faster to ingest: > [http://erikerlandson.github.io/blog/2019/05/23/unit-types-for-avro-schema-integrating-avro-with-coulomb/] > > The project itself is here: > [https://github.com/erikerlandson/coulomb] > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (AVRO-2474) Support a "unit" property of schema fields
[ https://issues.apache.org/jira/browse/AVRO-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297252#comment-17297252 ] Ryan Skraba commented on AVRO-2474: --- Time in source status: 599d 16h 35m :/ I want to say that I appreciate the engineering work that went into the implementation and the proposal, and I watched the video. Is development work continuing on [Coulomb|https://github.com/erikerlandson/coulomb]? It's a project that deserves a shout out! On our end, and in my experience, unit analysis and metadata in the schema doesn't fit well with our company's use of Avro for persistence and data transfer. We do things like "semantic typing" (some units and other categories) and filtering, and I've talked about your proposal internally as a future work, but as it stands, we just do all of our work in our toolkit on top of Avro. Do you think we could put this work in the wIki as [AEP 104|https://cwiki.apache.org/confluence/display/AVRO/Avro+Enhancement+Proposals] with links to Coulomb and this JIRA for future reference? Was it ever voted on? I could only find this [original discussion|https://lists.apache.org/x/list.html?dev@avro.apache.org:gte=1d:units]. Another idea -- there's a different JIRA AVRO-2952 that also adds a lot of custom processing (for DI-like annotations). It might be worthwhile taking a look to see what we would need to be able to specify things like "units" and "di-annotations" as an opt-in part of the spec with some sort of extension framework or entrypoint. If it were "pluggable" instead of part of core, it would be easier to adopt and innovate. > Support a "unit" property of schema fields > -- > > Key: AVRO-2474 > URL: https://issues.apache.org/jira/browse/AVRO-2474 > Project: Apache Avro > Issue Type: Improvement > Components: spec >Affects Versions: 1.9.0 >Reporter: Erik Erlandson >Priority: Major > Labels: aep > > Recently I have been experimenting with avro schema that are extended with a > "unit" field. By "unit" I mean expressions like "second", or "megabyte" - > that is "units of measure". > > I received some community interest in making this concept "first class" for > avro; I'm filing this JIRA to track the idea. > > I delivered a short talk on my experiments at Berlin Buzzwords, which can be > viewed here: > [https://www.youtube.com/watch?v=qrQmB2KFKE8] > > I also wrote a short blog post that may be faster to ingest: > [http://erikerlandson.github.io/blog/2019/05/23/unit-types-for-avro-schema-integrating-avro-with-coulomb/] > > The project itself is here: > [https://github.com/erikerlandson/coulomb] > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (AVRO-2474) Support a "unit" property of schema fields
[ https://issues.apache.org/jira/browse/AVRO-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163837#comment-17163837 ] Erik Erlandson commented on AVRO-2474: -- [~iemejia] [~rskraba] how should this proceed? > Support a "unit" property of schema fields > -- > > Key: AVRO-2474 > URL: https://issues.apache.org/jira/browse/AVRO-2474 > Project: Apache Avro > Issue Type: Improvement > Components: spec >Affects Versions: 1.9.0 >Reporter: Erik Erlandson >Priority: Major > Labels: aep > > Recently I have been experimenting with avro schema that are extended with a > "unit" field. By "unit" I mean expressions like "second", or "megabyte" - > that is "units of measure". > > I received some community interest in making this concept "first class" for > avro; I'm filing this JIRA to track the idea. > > I delivered a short talk on my experiments at Berlin Buzzwords, which can be > viewed here: > [https://www.youtube.com/watch?v=qrQmB2KFKE8] > > I also wrote a short blog post that may be faster to ingest: > [http://erikerlandson.github.io/blog/2019/05/23/unit-types-for-avro-schema-integrating-avro-with-coulomb/] > > The project itself is here: > [https://github.com/erikerlandson/coulomb] > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (AVRO-2474) Support a "unit" property of schema fields
[ https://issues.apache.org/jira/browse/AVRO-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116783#comment-17116783 ] Erik Erlandson commented on AVRO-2474: -- [~rskraba] [~iemejia] I agree that this feature ought to be voted on, or otherwise formally discussed by the community. It involves a few hundred lines of new code (per language). I submitted pr #841 so that the community could see how this feature would work, and what the implementation entails, etc, so people have something "real" to make a decision with, instead of voting on an abstract AEP doc. It involves more than just code - as you can see in the PR I need to define a JSON (sub)schema for expressing units, and while I'm happy with the shape of the current proposal, there are multiple design choices that might be made here. Lastly, adopting this implies committing to multiple language implementations, not just python. The good news is that it can easily enough be implemented on a per-language basis, but eventually implementations would be needed for at least some popular subset of the avro language bindings (I'm guessing at least python, java and c++). I do not know what the timeline for 1.10 branch-cut is. In theory I could massage #841 into a merge-able state fairly quickly, but it is a high-impact feature and I don't really want to rush it. > Support a "unit" property of schema fields > -- > > Key: AVRO-2474 > URL: https://issues.apache.org/jira/browse/AVRO-2474 > Project: Apache Avro > Issue Type: Improvement > Components: spec >Affects Versions: 1.9.0 >Reporter: Erik Erlandson >Priority: Major > Labels: aep > > Recently I have been experimenting with avro schema that are extended with a > "unit" field. By "unit" I mean expressions like "second", or "megabyte" - > that is "units of measure". > > I received some community interest in making this concept "first class" for > avro; I'm filing this JIRA to track the idea. > > I delivered a short talk on my experiments at Berlin Buzzwords, which can be > viewed here: > [https://www.youtube.com/watch?v=qrQmB2KFKE8] > > I also wrote a short blog post that may be faster to ingest: > [http://erikerlandson.github.io/blog/2019/05/23/unit-types-for-avro-schema-integrating-avro-with-coulomb/] > > The project itself is here: > [https://github.com/erikerlandson/coulomb] > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (AVRO-2474) Support a "unit" property of schema fields
[ https://issues.apache.org/jira/browse/AVRO-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116572#comment-17116572 ] Ismaël Mejía commented on AVRO-2474: I think since this is an Avro Enhacement Proposal (aep label) we require too of consensus on the feature so also worth to discuss (and vote) that in the mailing list. > Support a "unit" property of schema fields > -- > > Key: AVRO-2474 > URL: https://issues.apache.org/jira/browse/AVRO-2474 > Project: Apache Avro > Issue Type: Improvement > Components: spec >Affects Versions: 1.9.0 >Reporter: Erik Erlandson >Priority: Major > Labels: aep > > Recently I have been experimenting with avro schema that are extended with a > "unit" field. By "unit" I mean expressions like "second", or "megabyte" - > that is "units of measure". > > I received some community interest in making this concept "first class" for > avro; I'm filing this JIRA to track the idea. > > I delivered a short talk on my experiments at Berlin Buzzwords, which can be > viewed here: > [https://www.youtube.com/watch?v=qrQmB2KFKE8] > > I also wrote a short blog post that may be faster to ingest: > [http://erikerlandson.github.io/blog/2019/05/23/unit-types-for-avro-schema-integrating-avro-with-coulomb/] > > The project itself is here: > [https://github.com/erikerlandson/coulomb] > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (AVRO-2474) Support a "unit" property of schema fields
[ https://issues.apache.org/jira/browse/AVRO-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116571#comment-17116571 ] Ryan Skraba commented on AVRO-2474: --- Hello! I've removed the 1.10.0 fix target for this new feature. Is that OK? This is a pretty major (and neat) feature to add and it doesn't look like it's going to be ready for when we cut the branch... I think we'd all love to see new and interesting features move forward. Maybe it would be a good idea to create some actionable subtasks that can be completed progressively? > Support a "unit" property of schema fields > -- > > Key: AVRO-2474 > URL: https://issues.apache.org/jira/browse/AVRO-2474 > Project: Apache Avro > Issue Type: Improvement > Components: spec >Affects Versions: 1.9.0 >Reporter: Erik Erlandson >Priority: Major > > Recently I have been experimenting with avro schema that are extended with a > "unit" field. By "unit" I mean expressions like "second", or "megabyte" - > that is "units of measure". > > I received some community interest in making this concept "first class" for > avro; I'm filing this JIRA to track the idea. > > I delivered a short talk on my experiments at Berlin Buzzwords, which can be > viewed here: > [https://www.youtube.com/watch?v=qrQmB2KFKE8] > > I also wrote a short blog post that may be faster to ingest: > [http://erikerlandson.github.io/blog/2019/05/23/unit-types-for-avro-schema-integrating-avro-with-coulomb/] > > The project itself is here: > [https://github.com/erikerlandson/coulomb] > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (AVRO-2474) Support a "unit" property of schema fields
[ https://issues.apache.org/jira/browse/AVRO-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931958#comment-16931958 ] Erik Erlandson commented on AVRO-2474: -- I have written up a draft of an Avro Enhancement Proposal, that describes a roadmap where unit expression parsing conversion is done as a part of resolving reader schemas with writer schemas: [Avro Enhancement Proposal (AEP): Unit Metadata|https://docs.google.com/document/d/1IeVAtf6YcAAn35D4jmFQJjPpEMgEu79wWVMW37KvNps/] > Support a "unit" property of schema fields > -- > > Key: AVRO-2474 > URL: https://issues.apache.org/jira/browse/AVRO-2474 > Project: Apache Avro > Issue Type: Improvement > Components: spec >Affects Versions: 1.9.0 >Reporter: Erik Erlandson >Priority: Major > > Recently I have been experimenting with avro schema that are extended with a > "unit" field. By "unit" I mean expressions like "second", or "megabyte" - > that is "units of measure". > > I received some community interest in making this concept "first class" for > avro; I'm filing this JIRA to track the idea. > > I delivered a short talk on my experiments at Berlin Buzzwords, which can be > viewed here: > [https://www.youtube.com/watch?v=qrQmB2KFKE8] > > I also wrote a short blog post that may be faster to ingest: > [http://erikerlandson.github.io/blog/2019/05/23/unit-types-for-avro-schema-integrating-avro-with-coulomb/] > > The project itself is here: > [https://github.com/erikerlandson/coulomb] > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (AVRO-2474) Support a "unit" property of schema fields
[ https://issues.apache.org/jira/browse/AVRO-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886436#comment-16886436 ] Erik Erlandson commented on AVRO-2474: -- Questions & comments from copied from Ryan Skraba, via email thread: {quote}Would you expect a writer schema that wrote a Fahrenheit field and a reader schema that reads Celsius to interact transparently with generic data? {quote} (With the caveat that there is a lot I don't know about Avro) IIUC, the writer schema is saved with written data. So if a writer schema had "unit":"fahrenheit", and on input a reader schema had "unit":"celsius", then I would expect this to be detected and correctly converted, transparently. Tangentially: "fahrenheit", "celsius" & "kelvin" are interesting because they might denote either a "unit" (a quantity of degrees) or an actual "temperature" (having a particular offset). In general, "temperature" is not the same thing as a "unit" of degrees, see [here|https://github.com/erikerlandson/coulomb#temperature-values]. The upshot is that there will be "temperature" attributes on a schema as well as "unit" of degrees. Similarly there will be "timestamp" and/or "date", as well as "unit" of time, although that is something I haven't added to coulomb yet. {quote}What about conversions that lose precision (i.e., if the above conversion was on an INT field) {quote} This is an great question; there is not one obvious policy. In coulomb, my default conversion policy is "best effort", which I [define|https://github.com/erikerlandson/coulomb/blob/develop/coulomb/src/main/scala/coulomb/unitops/unitops.scala#L183] as: translate the input and conversion factor to Rational, multiply, and then convert to Integer (or Long, etc). The tradeoff here is some compute cost. Other policies could be defined that are faster (and maybe even more aligned with standard float to int interactions). I do not have a strong opinion on this. I think I'd want to do whatever is most intuitive for most members of the community. {quote}How much of "unit" support should be mandatory in the spec for cross language operation? (a unit-aware Scala writer with a Fahrenheit field and a non-unit-aware reader with a Celsius field) To what degree would a generic reader of Avro data be required to support quantity wrappers (i.e. how can we opt-in/opt-out cleanly from being unit-aware)? {quote} If the necessary "unit" information is present on both the write schema and reader schema, then I believe this might be made "mandatory" across languages. The values themselves (in the code) might not have any unit types attached (as I support with coulomb), but the unit fields on the schema could be checked for compatibility and converted. In that sense, we might make actual unit-types in a language optional. This might be a way to provide meaningful support for language that either can't or don't yet support a concept of unit type in the code itself. My tentative idea for a policy on this is: if data is written using a schema with units, then either the reader-schema must also provide a compatible unit, and/or the code must somehow specify the requested unit. Otherwise, a read error will be raised. {quote}At scale, I'd be particularly keen to see the conversion detection (between two schemas / fields / quantities) take place once, and then the calculation reused for all of the subsequent datum passing through, but I'm not sure how that would work {quote} In my current implementation, a unit conversion factor is computed once, and then it is cached on the schema itself, and detected via key lookup on subsequent reads. This actually was not nearly as slow as I'd feared, but it is still an extra key lookup per read. When the unit is coming from the read call (as I currently do it), I am not very sure how to do better. If the write and read schema are being resolved in the avro system itself, I can imagine better performance, equivalent to just checking a boolean per read. You can see what the current code does [here|https://github.com/erikerlandson/coulomb/blob/develop/coulomb-avro/src/main/scala/coulomb/avro/package.scala#L51]. I'm optimistic that Avro schema dev community might have good ideas here. {quote}(quantity of items, percents and other ratios, ratings) {quote} My possibly-controvertial position is that "items", percents and ratings *do* have implied units. Ratios of course are likely to be truly unitless, although measures such angular degrees, radians, etc, are useful units that are secretly derived from "Unitless". That is, we deal every day with many kinds of units, but because they are not SI-units, we are trained to not think of them as unit quantities. As a simple example, the length of a vector of Products is an integer value with unit "Product", and the length of a vector of People is an integer having unit "People", etc. In my
[jira] [Commented] (AVRO-2474) Support a "unit" property of schema fields
[ https://issues.apache.org/jira/browse/AVRO-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885691#comment-16885691 ] Erik Erlandson commented on AVRO-2474: -- (copied from email thread) Regarding schema, my proposal for fingerprints would be that units are fingerprinted based on their canonical form, as [defined here|http://erikerlandson.github.io/blog/2019/05/03/algorithmic-unit-analysis/]. Any two unit expressions having the same canonical form (including the corresponding coefficients) are exactly equivalent, and so their fingerprints can be the same. Possibly the unit could be stored on the schema in canonical form by convention, although canonical forms are frequently not as intuitive to humans and so in that case the documentation value of the unit might be reduced for humans examining the schema. For schema evolution, a unit change such that the previous and new unit are convertable (also defined as at the above link) would be well defined, and automatic transformation would just be the correct unit conversion (e.g. seconds to milliseconds). If the unit changes to a non-convertable unit (e.g. seconds to bytes) then no automatic transformation exists, and attempting to resolve the old and new schema would be an error. Note that establishing the conversion assumes that both original and new schemas are available at read time. > Support a "unit" property of schema fields > -- > > Key: AVRO-2474 > URL: https://issues.apache.org/jira/browse/AVRO-2474 > Project: Apache Avro > Issue Type: Improvement > Components: spec >Affects Versions: 1.9.0 >Reporter: Erik Erlandson >Priority: Major > > Recently I have been experimenting with avro schema that are extended with a > "unit" field. By "unit" I mean expressions like "second", or "megabyte" - > that is "units of measure". > > I received some community interest in making this concept "first class" for > avro; I'm filing this JIRA to track the idea. > > I delivered a short talk on my experiments at Berlin Buzzwords, which can be > viewed here: > [https://www.youtube.com/watch?v=qrQmB2KFKE8] > > I also wrote a short blog post that may be faster to ingest: > [http://erikerlandson.github.io/blog/2019/05/23/unit-types-for-avro-schema-integrating-avro-with-coulomb/] > > The project itself is here: > [https://github.com/erikerlandson/coulomb] > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)