[jira] [Commented] (AVRO-2474) Support a "unit" property of schema fields

2021-03-08 Thread Erik Erlandson (Jira)


[ 
https://issues.apache.org/jira/browse/AVRO-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17297494#comment-17297494
 ] 

Erik Erlandson commented on AVRO-2474:
--

[~rskraba]  I am definitely continuing to work on coulomb.  The next big 
development push will be the move to scala-3.

I submitted a draft AEP: 
[https://docs.google.com/document/d/1IeVAtf6YcAAn35D4jmFQJjPpEMgEu79wWVMW37KvNps/edit]

As far as I know, there was no feedback on it, or voting.

I like the idea of being able to consume support for unit analysis as a 
plug-in.  If people are interested, I can look into it.

> Support a "unit" property of schema fields
> --
>
> Key: AVRO-2474
> URL: https://issues.apache.org/jira/browse/AVRO-2474
> Project: Apache Avro
>  Issue Type: Improvement
>  Components: spec
>Affects Versions: 1.9.0
>Reporter: Erik Erlandson
>Priority: Major
>  Labels: aep
>
> Recently I have been experimenting with avro schema that are extended with a 
> "unit" field. By "unit" I mean expressions like "second", or "megabyte" - 
> that is "units of measure".
>  
> I received some community interest in making this concept "first class" for 
> avro; I'm filing this JIRA to track the idea. 
>  
> I delivered a short talk on my experiments at Berlin Buzzwords, which can be 
> viewed here:
> [https://www.youtube.com/watch?v=qrQmB2KFKE8]
>  
> I also wrote a short blog post that may be faster to ingest:
> [http://erikerlandson.github.io/blog/2019/05/23/unit-types-for-avro-schema-integrating-avro-with-coulomb/]
>  
> The project itself is here:
> [https://github.com/erikerlandson/coulomb]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (AVRO-2474) Support a "unit" property of schema fields

2021-03-06 Thread Erik Erlandson (Jira)


 [ 
https://issues.apache.org/jira/browse/AVRO-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson resolved AVRO-2474.
--
Resolution: Abandoned

> Support a "unit" property of schema fields
> --
>
> Key: AVRO-2474
> URL: https://issues.apache.org/jira/browse/AVRO-2474
> Project: Apache Avro
>  Issue Type: Improvement
>  Components: spec
>Affects Versions: 1.9.0
>Reporter: Erik Erlandson
>Priority: Major
>  Labels: aep
>
> Recently I have been experimenting with avro schema that are extended with a 
> "unit" field. By "unit" I mean expressions like "second", or "megabyte" - 
> that is "units of measure".
>  
> I received some community interest in making this concept "first class" for 
> avro; I'm filing this JIRA to track the idea. 
>  
> I delivered a short talk on my experiments at Berlin Buzzwords, which can be 
> viewed here:
> [https://www.youtube.com/watch?v=qrQmB2KFKE8]
>  
> I also wrote a short blog post that may be faster to ingest:
> [http://erikerlandson.github.io/blog/2019/05/23/unit-types-for-avro-schema-integrating-avro-with-coulomb/]
>  
> The project itself is here:
> [https://github.com/erikerlandson/coulomb]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AVRO-2474) Support a "unit" property of schema fields

2020-07-23 Thread Erik Erlandson (Jira)


[ 
https://issues.apache.org/jira/browse/AVRO-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17163837#comment-17163837
 ] 

Erik Erlandson commented on AVRO-2474:
--

[~iemejia]  [~rskraba]  how should this proceed?

 

> Support a "unit" property of schema fields
> --
>
> Key: AVRO-2474
> URL: https://issues.apache.org/jira/browse/AVRO-2474
> Project: Apache Avro
>  Issue Type: Improvement
>  Components: spec
>Affects Versions: 1.9.0
>Reporter: Erik Erlandson
>Priority: Major
>  Labels: aep
>
> Recently I have been experimenting with avro schema that are extended with a 
> "unit" field. By "unit" I mean expressions like "second", or "megabyte" - 
> that is "units of measure".
>  
> I received some community interest in making this concept "first class" for 
> avro; I'm filing this JIRA to track the idea. 
>  
> I delivered a short talk on my experiments at Berlin Buzzwords, which can be 
> viewed here:
> [https://www.youtube.com/watch?v=qrQmB2KFKE8]
>  
> I also wrote a short blog post that may be faster to ingest:
> [http://erikerlandson.github.io/blog/2019/05/23/unit-types-for-avro-schema-integrating-avro-with-coulomb/]
>  
> The project itself is here:
> [https://github.com/erikerlandson/coulomb]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AVRO-2474) Support a "unit" property of schema fields

2020-05-26 Thread Erik Erlandson (Jira)


[ 
https://issues.apache.org/jira/browse/AVRO-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17116783#comment-17116783
 ] 

Erik Erlandson commented on AVRO-2474:
--

[~rskraba]  [~iemejia]   I agree that this feature ought to be voted on, or 
otherwise formally discussed by the community. It involves a few hundred lines 
of new code (per language).  I submitted pr #841 so that the community could 
see how this feature would work, and what the implementation entails, etc, so 
people have something "real" to make a decision with, instead of voting on an 
abstract AEP doc.

It involves more than just code - as you can see in the PR I need to define a 
JSON (sub)schema for expressing units, and while I'm happy with the shape of 
the current proposal, there are multiple design choices that might be made here.

Lastly, adopting this implies committing to multiple language implementations, 
not just python. The good news is that it can easily enough be implemented on a 
per-language basis, but eventually implementations would be needed for at least 
some popular subset of the avro language bindings (I'm guessing at least 
python, java and c++).

I do not know what the timeline for 1.10 branch-cut is.  In theory I could 
massage #841 into a merge-able state fairly quickly, but it is a high-impact 
feature and I don't really want to rush it.

> Support a "unit" property of schema fields
> --
>
> Key: AVRO-2474
> URL: https://issues.apache.org/jira/browse/AVRO-2474
> Project: Apache Avro
>  Issue Type: Improvement
>  Components: spec
>Affects Versions: 1.9.0
>Reporter: Erik Erlandson
>Priority: Major
>  Labels: aep
>
> Recently I have been experimenting with avro schema that are extended with a 
> "unit" field. By "unit" I mean expressions like "second", or "megabyte" - 
> that is "units of measure".
>  
> I received some community interest in making this concept "first class" for 
> avro; I'm filing this JIRA to track the idea. 
>  
> I delivered a short talk on my experiments at Berlin Buzzwords, which can be 
> viewed here:
> [https://www.youtube.com/watch?v=qrQmB2KFKE8]
>  
> I also wrote a short blog post that may be faster to ingest:
> [http://erikerlandson.github.io/blog/2019/05/23/unit-types-for-avro-schema-integrating-avro-with-coulomb/]
>  
> The project itself is here:
> [https://github.com/erikerlandson/coulomb]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AVRO-2748) python schema resolution occurs on every read

2020-02-28 Thread Erik Erlandson (Jira)


[ 
https://issues.apache.org/jira/browse/AVRO-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17047663#comment-17047663
 ] 

Erik Erlandson commented on AVRO-2748:
--

In a sense, resolving "int" against ["int", "string"] is not a type-safe match. 
 I can see why someone might want to allow it, but I can also imagine not 
wanting it to succeed, for exactly the reason you showed - it can fail partway 
through a data set.

It makes me wonder if there should be two modes of schema resolution. The mode 
that exists, which is sort of like "runtime type checking" and another mode 
that is closer to "compile-time type checking" in the sense that it (1) happens 
once, up front, and (2) if it does succeed, you can safely assume all your data 
reads will succeed.

> python schema resolution occurs on every read
> -
>
> Key: AVRO-2748
> URL: https://issues.apache.org/jira/browse/AVRO-2748
> Project: Apache Avro
>  Issue Type: Bug
>  Components: python
>Affects Versions: 1.9.2
>Reporter: Erik Erlandson
>Priority: Minor
>
> In python, the schema resolution appears to be happening on each read 
> operation. I'm not an avro expert but in my perusing through the python io 
> code I haven't yet noticed a reason that the schema resolution couldn't 
> happen once up front, during the construction of DataFileReader, when it 
> first loads the write_schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AVRO-2748) python schema resolution occurs on every read

2020-02-24 Thread Erik Erlandson (Jira)


[ 
https://issues.apache.org/jira/browse/AVRO-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17043533#comment-17043533
 ] 

Erik Erlandson commented on AVRO-2748:
--

Yes, matching schemas once, during DatumReader construction, is exactly what I 
am thinking.  And I think you hit on the case I was confused about - resolving 
"union" types, where union options might or might not be compatible.

One idea I was toying with was doing once-up-front schema matching IF such 
matches are unambiguous - i.e. if no union types are in play. Possibly I am 
still missing some subtleties, but if neither the write nor read schema have 
unions, then it still seems possible to either match or fail up front and not 
have to do it again. Schemas with no union types seems like a pretty relevant 
use case.

> python schema resolution occurs on every read
> -
>
> Key: AVRO-2748
> URL: https://issues.apache.org/jira/browse/AVRO-2748
> Project: Apache Avro
>  Issue Type: Bug
>  Components: python
>Affects Versions: 1.9.2
>Reporter: Erik Erlandson
>Priority: Minor
>
> In python, the schema resolution appears to be happening on each read 
> operation. I'm not an avro expert but in my perusing through the python io 
> code I haven't yet noticed a reason that the schema resolution couldn't 
> happen once up front, during the construction of DataFileReader, when it 
> first loads the write_schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AVRO-2748) python schema resolution occurs on every read

2020-02-22 Thread Erik Erlandson (Jira)


[ 
https://issues.apache.org/jira/browse/AVRO-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17042660#comment-17042660
 ] 

Erik Erlandson commented on AVRO-2748:
--

Currently the 'match_schemas' function is flat: it doesn't recursively check 
schema structures, but instead allows the recursion on data reading to drive 
structural recursion.  Schema matching itself should be made recursive, driven 
by read-schema so it can ignore any write-schema structures that aren't being 
requested for read.

> python schema resolution occurs on every read
> -
>
> Key: AVRO-2748
> URL: https://issues.apache.org/jira/browse/AVRO-2748
> Project: Apache Avro
>  Issue Type: Bug
>  Components: python
>Affects Versions: 1.9.2
>Reporter: Erik Erlandson
>Priority: Minor
>
> In python, the schema resolution appears to be happening on each read 
> operation. I'm not an avro expert but in my perusing through the python io 
> code I haven't yet noticed a reason that the schema resolution couldn't 
> happen once up front, during the construction of DataFileReader, when it 
> first loads the write_schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (AVRO-2748) python schema resolution occurs on every read

2020-02-22 Thread Erik Erlandson (Jira)


 [ 
https://issues.apache.org/jira/browse/AVRO-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson updated AVRO-2748:
-
Description: In python, the schema resolution appears to be happening on 
each read operation. I'm not an avro expert but in my perusing through the 
python io code I haven't yet noticed a reason that the schema resolution 
couldn't happen once up front, during the construction of DataFileReader, when 
it first loads the write_schema.  (was: In python, the schema resolution 
appears to be happening on each read operation. I'm not an avro expert but in 
my perusing through the py3 io code I haven't yet noticed a reason that the 
schema resolution couldn't happen once up front, during the construction of 
DataFileReader, when it first loads the write_schema.)

> python schema resolution occurs on every read
> -
>
> Key: AVRO-2748
> URL: https://issues.apache.org/jira/browse/AVRO-2748
> Project: Apache Avro
>  Issue Type: Bug
>  Components: python
>Affects Versions: 1.9.2
>Reporter: Erik Erlandson
>Priority: Minor
>
> In python, the schema resolution appears to be happening on each read 
> operation. I'm not an avro expert but in my perusing through the python io 
> code I haven't yet noticed a reason that the schema resolution couldn't 
> happen once up front, during the construction of DataFileReader, when it 
> first loads the write_schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (AVRO-2748) python schema resolution occurs on every read

2020-02-22 Thread Erik Erlandson (Jira)


 [ 
https://issues.apache.org/jira/browse/AVRO-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson updated AVRO-2748:
-
Summary: python schema resolution occurs on every read  (was: py3 schema 
resolution occurs on every read)

> python schema resolution occurs on every read
> -
>
> Key: AVRO-2748
> URL: https://issues.apache.org/jira/browse/AVRO-2748
> Project: Apache Avro
>  Issue Type: Bug
>  Components: python
>Affects Versions: 1.9.2
>Reporter: Erik Erlandson
>Priority: Minor
>
> In python, the schema resolution appears to be happening on each read 
> operation. I'm not an avro expert but in my perusing through the py3 io code 
> I haven't yet noticed a reason that the schema resolution couldn't happen 
> once up front, during the construction of DataFileReader, when it first loads 
> the write_schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AVRO-2748) py3 schema resolution occurs on every read

2020-02-22 Thread Erik Erlandson (Jira)


[ 
https://issues.apache.org/jira/browse/AVRO-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17042646#comment-17042646
 ] 

Erik Erlandson commented on AVRO-2748:
--

[~kojiromike] thanks for the heads up about py3!  I'll have to move my dev on 
AVRO-2474 to 'py'.

I'm not sure how to visibly reproduce besides adding print statements, but you 
can see that the 'match_schemas' method is called in 'read_data' here:

[https://github.com/apache/avro/blob/master/lang/py/avro/io.py#L669]

And that's called, for example, on each iteration of '__next__':

[https://github.com/apache/avro/blob/master/lang/py/avro/datafile.py#L336]

 

> py3 schema resolution occurs on every read
> --
>
> Key: AVRO-2748
> URL: https://issues.apache.org/jira/browse/AVRO-2748
> Project: Apache Avro
>  Issue Type: Bug
>  Components: python
>Affects Versions: 1.9.2
>Reporter: Erik Erlandson
>Priority: Minor
>
> In python, the schema resolution appears to be happening on each read 
> operation. I'm not an avro expert but in my perusing through the py3 io code 
> I haven't yet noticed a reason that the schema resolution couldn't happen 
> once up front, during the construction of DataFileReader, when it first loads 
> the write_schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (AVRO-2748) py3 schema resolution occurs on every read

2020-02-18 Thread Erik Erlandson (Jira)
Erik Erlandson created AVRO-2748:


 Summary: py3 schema resolution occurs on every read
 Key: AVRO-2748
 URL: https://issues.apache.org/jira/browse/AVRO-2748
 Project: Apache Avro
  Issue Type: Bug
  Components: python
Affects Versions: 1.9.2
Reporter: Erik Erlandson


In python, the schema resolution appears to be happening on each read 
operation. I'm not an avro expert but in my perusing through the py3 io code I 
haven't yet noticed a reason that the schema resolution couldn't happen once up 
front, during the construction of DataFileReader, when it first loads the 
write_schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (AVRO-2474) Support a "unit" property of schema fields

2019-09-18 Thread Erik Erlandson (Jira)


 [ 
https://issues.apache.org/jira/browse/AVRO-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson updated AVRO-2474:
-
Fix Version/s: 1.10.0

> Support a "unit" property of schema fields
> --
>
> Key: AVRO-2474
> URL: https://issues.apache.org/jira/browse/AVRO-2474
> Project: Apache Avro
>  Issue Type: Improvement
>  Components: spec
>Affects Versions: 1.9.0
>Reporter: Erik Erlandson
>Priority: Major
> Fix For: 1.10.0
>
>
> Recently I have been experimenting with avro schema that are extended with a 
> "unit" field. By "unit" I mean expressions like "second", or "megabyte" - 
> that is "units of measure".
>  
> I received some community interest in making this concept "first class" for 
> avro; I'm filing this JIRA to track the idea. 
>  
> I delivered a short talk on my experiments at Berlin Buzzwords, which can be 
> viewed here:
> [https://www.youtube.com/watch?v=qrQmB2KFKE8]
>  
> I also wrote a short blog post that may be faster to ingest:
> [http://erikerlandson.github.io/blog/2019/05/23/unit-types-for-avro-schema-integrating-avro-with-coulomb/]
>  
> The project itself is here:
> [https://github.com/erikerlandson/coulomb]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AVRO-2474) Support a "unit" property of schema fields

2019-09-17 Thread Erik Erlandson (Jira)


[ 
https://issues.apache.org/jira/browse/AVRO-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931958#comment-16931958
 ] 

Erik Erlandson commented on AVRO-2474:
--

I have written up a draft of an Avro Enhancement Proposal, that describes a 
roadmap where unit expression parsing conversion is done as a part of resolving 
reader schemas with writer schemas:

[Avro Enhancement Proposal (AEP): Unit 
Metadata|https://docs.google.com/document/d/1IeVAtf6YcAAn35D4jmFQJjPpEMgEu79wWVMW37KvNps/]

 

> Support a "unit" property of schema fields
> --
>
> Key: AVRO-2474
> URL: https://issues.apache.org/jira/browse/AVRO-2474
> Project: Apache Avro
>  Issue Type: Improvement
>  Components: spec
>Affects Versions: 1.9.0
>Reporter: Erik Erlandson
>Priority: Major
>
> Recently I have been experimenting with avro schema that are extended with a 
> "unit" field. By "unit" I mean expressions like "second", or "megabyte" - 
> that is "units of measure".
>  
> I received some community interest in making this concept "first class" for 
> avro; I'm filing this JIRA to track the idea. 
>  
> I delivered a short talk on my experiments at Berlin Buzzwords, which can be 
> viewed here:
> [https://www.youtube.com/watch?v=qrQmB2KFKE8]
>  
> I also wrote a short blog post that may be faster to ingest:
> [http://erikerlandson.github.io/blog/2019/05/23/unit-types-for-avro-schema-integrating-avro-with-coulomb/]
>  
> The project itself is here:
> [https://github.com/erikerlandson/coulomb]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AVRO-2474) Support a "unit" property of schema fields

2019-07-16 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/AVRO-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886436#comment-16886436
 ] 

Erik Erlandson commented on AVRO-2474:
--

Questions & comments from copied from Ryan Skraba, via email thread:

 
{quote}Would you expect a writer schema that wrote a Fahrenheit field and a
 reader schema that reads Celsius to interact transparently with generic
 data?
{quote}
(With the caveat that there is a lot I don't know about Avro) IIUC, the writer 
schema is saved with written data. So if a writer schema had 
"unit":"fahrenheit", and on input a reader schema had "unit":"celsius", then I 
would expect this to be detected and correctly converted, transparently.

Tangentially: "fahrenheit", "celsius" & "kelvin" are interesting because they 
might denote either a "unit" (a quantity of degrees) or an actual "temperature" 
(having a particular offset). In general, "temperature" is not the same thing 
as a "unit" of degrees, see 
[here|https://github.com/erikerlandson/coulomb#temperature-values]. The upshot 
is that there will be "temperature" attributes on a schema as well as "unit" of 
degrees. Similarly there will be "timestamp" and/or "date", as well as "unit" 
of time, although that is something I haven't added to coulomb yet.
{quote}What about conversions that lose precision (i.e., if the above conversion
 was on an INT field)
{quote}
This is an great question; there is not one obvious policy. In coulomb, my 
default conversion policy is "best effort", which I 
[define|https://github.com/erikerlandson/coulomb/blob/develop/coulomb/src/main/scala/coulomb/unitops/unitops.scala#L183]
 as: translate the input and conversion factor to Rational, multiply, and then 
convert to Integer (or Long, etc). The tradeoff here is some compute cost. 
Other policies could be defined that are faster (and maybe even more aligned 
with standard float to int interactions). I do not have a strong opinion on 
this. I think I'd want to do whatever is most intuitive for most members of the 
community.
{quote}How much of "unit" support should be mandatory in the spec for cross 
language operation?  (a unit-aware Scala writer with a Fahrenheit field and a 
non-unit-aware reader with a Celsius field) To what degree would a generic 
reader of Avro data be required to support quantity wrappers (i.e. how can we 
opt-in/opt-out cleanly from being unit-aware)?
{quote}
If the necessary "unit" information is present on both the write schema and 
reader schema, then I believe this might be made "mandatory" across languages. 
The values themselves (in the code) might not have any unit types attached (as 
I support with coulomb), but the unit fields on the schema could be checked for 
compatibility and converted. In that sense, we might make actual unit-types in 
a language optional. This might be a way to provide meaningful support for 
language that either can't or don't yet support a concept of unit type in the 
code itself. My tentative idea for a policy on this is: if data is written 
using a schema with units, then either the reader-schema must also provide a 
compatible unit, and/or the code must somehow specify the requested unit. 
Otherwise, a read error will be raised.
{quote}At scale, I'd be particularly keen to see the conversion detection 
(between
 two schemas / fields / quantities) take place once, and then the
 calculation reused for all of the subsequent datum passing through, but I'm
 not sure how that would work
{quote}
In my current implementation, a unit conversion factor is computed once, and 
then it is cached on the schema itself, and detected via key lookup on 
subsequent reads. This actually was not nearly as slow as I'd feared, but it is 
still an extra key lookup per read. When the unit is coming from the read call 
(as I currently do it), I am not very sure how to do better. If the write and 
read schema are being resolved in the avro system itself, I can imagine better 
performance, equivalent to just checking a boolean per read. You can see what 
the current code does 
[here|https://github.com/erikerlandson/coulomb/blob/develop/coulomb-avro/src/main/scala/coulomb/avro/package.scala#L51].
 I'm optimistic that Avro schema dev community might have good ideas here.
{quote}(quantity of items, percents and other ratios, ratings)
{quote}
My possibly-controvertial position is that "items", percents and ratings *do* 
have implied units. Ratios of course are likely to be truly unitless, although 
measures such angular degrees, radians, etc, are useful units that are secretly 
derived from "Unitless".

That is, we deal every day with many kinds of units, but because they are not 
SI-units, we are trained to not think of them as unit quantities. As a simple 
example, the length of a vector of Products is an integer value with unit 
"Product", and the length of a vector of People is an integer having unit 
"People

[jira] [Commented] (AVRO-2474) Support a "unit" property of schema fields

2019-07-15 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/AVRO-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885691#comment-16885691
 ] 

Erik Erlandson commented on AVRO-2474:
--

(copied from email thread)

 
Regarding schema, my proposal for fingerprints would be that units are 
fingerprinted based on their canonical form, as [defined 
here|http://erikerlandson.github.io/blog/2019/05/03/algorithmic-unit-analysis/].
 Any two unit expressions having the same canonical form (including the 
corresponding coefficients) are exactly equivalent, and so their fingerprints 
can be the same. Possibly the unit could be stored on the schema in canonical 
form by convention, although canonical forms are frequently not as intuitive to 
humans and so in that case the documentation value of the unit might be reduced 
for humans examining the schema.
 
For schema evolution, a unit change such that the previous and new unit are 
convertable (also defined as at the above link) would be well defined, and 
automatic transformation would just be the correct unit conversion (e.g. 
seconds to milliseconds). If the unit changes to a non-convertable unit (e.g. 
seconds to bytes) then no automatic transformation exists, and attempting to 
resolve the old and new schema would be an error. Note that establishing the 
conversion assumes that both original and new schemas are  available at read 
time.
 

> Support a "unit" property of schema fields
> --
>
> Key: AVRO-2474
> URL: https://issues.apache.org/jira/browse/AVRO-2474
> Project: Apache Avro
>  Issue Type: Improvement
>  Components: spec
>Affects Versions: 1.9.0
>Reporter: Erik Erlandson
>Priority: Major
>
> Recently I have been experimenting with avro schema that are extended with a 
> "unit" field. By "unit" I mean expressions like "second", or "megabyte" - 
> that is "units of measure".
>  
> I received some community interest in making this concept "first class" for 
> avro; I'm filing this JIRA to track the idea. 
>  
> I delivered a short talk on my experiments at Berlin Buzzwords, which can be 
> viewed here:
> [https://www.youtube.com/watch?v=qrQmB2KFKE8]
>  
> I also wrote a short blog post that may be faster to ingest:
> [http://erikerlandson.github.io/blog/2019/05/23/unit-types-for-avro-schema-integrating-avro-with-coulomb/]
>  
> The project itself is here:
> [https://github.com/erikerlandson/coulomb]
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (AVRO-2474) Support a "unit" property of schema fields

2019-07-15 Thread Erik Erlandson (JIRA)
Erik Erlandson created AVRO-2474:


 Summary: Support a "unit" property of schema fields
 Key: AVRO-2474
 URL: https://issues.apache.org/jira/browse/AVRO-2474
 Project: Apache Avro
  Issue Type: Improvement
  Components: spec
Affects Versions: 1.9.0
Reporter: Erik Erlandson


Recently I have been experimenting with avro schema that are extended with a 
"unit" field. By "unit" I mean expressions like "second", or "megabyte" - that 
is "units of measure".
 
I received some community interest in making this concept "first class" for 
avro; I'm filing this JIRA to track the idea. 
 
I delivered a short talk on my experiments at Berlin Buzzwords, which can be 
viewed here:
[https://www.youtube.com/watch?v=qrQmB2KFKE8]
 
I also wrote a short blog post that may be faster to ingest:
[http://erikerlandson.github.io/blog/2019/05/23/unit-types-for-avro-schema-integrating-avro-with-coulomb/]
 
The project itself is here:
[https://github.com/erikerlandson/coulomb]
 
 
 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)