Re: supporting a "unit" field for avro schema

2019-07-16 Thread Erik Erlandson
Hi Ryan,
Those are all great questions. They're all issues I have ideas about but
I'd want Avro community input for as well. For that reason I answered them
all on AVRO-2474 
Cheers!
E

On Tue, Jul 16, 2019 at 3:13 AM Ryan Skraba  wrote:

> Hello!  I've been thinking about this and I generally like the idea of
> stronger types with units :D
>
> I have some questions about what you are thinking of when you say "first
> class concept" in Avro:
> - Would you expect a writer schema that wrote a Fahrenheit field and a
> reader schema that reads Celsius to interact transparently with generic
> data?
> - What about conversions that lose precision (i.e., if the above conversion
> was on an INT field)?
> - How much of "unit" support should be mandatory in the spec for cross
> language operation?  (a unit-aware Scala writer with a Fahrenheit field and
> a non-unit-aware reader with a Celsius field).
> - To what degree would a generic reader of Avro data be required to support
> quantity wrappers (i.e. how can we opt-in/opt-out cleanly from being
> unit-aware)?
>
> At scale, I'd be particularly keen to see the conversion detection (between
> two schemas / fields / quantities) take place once, and then the
> calculation reused for all of the subsequent datum passing through, but I'm
> not sure how that would work.
>
> We have some experience with passing a lot of client data through Avro, and
> we use generic data quite a bit -- I'd be tempted to think of "float
> (metres)" as a distinct type from "float (minutes)", but it would be a huge
> (but potentially interesting) change for the way we look at data.  That
> being said, as far as units go, we see a lot more unitless values (quantity
> of items, percents and other ratios, ratings).  The most frequent numeric
> values with units that we see are probably money or geolocation (in
> practice, already normalized to lat/long -- although I just learned about
> UTM!).  Surprisingly, there's not as much SI-type unit data as you might
> expect.
>
> I can definitely see the value of using a "unit" annotation in a generated
> specific record for a supported language -- as proven by your scala work!
> That might be an easy first target while working out what a first-class
> concept in the spec would entail.  I missed Berlin Buzzwords by a day, but
> enjoyed the video, thanks!
>
> Ryan
>
>
>
> On Tue, Jul 16, 2019 at 1:24 AM Erik Erlandson 
> wrote:
>
> > If I'm interpreting the situation correctly, there is an "Avro
> Enhancement
> > Proposal", but none have been filed in nearly a decade:
> >
> https://cwiki.apache.org/confluence/display/AVRO/Avro+Enhancement+Proposals
> >
> > As a start, I submitted a jira to track this idea:
> > https://issues.apache.org/jira/browse/AVRO-2474
> >
> >
> >
> > On Mon, Jul 8, 2019 at 10:42 AM Erik Erlandson 
> > wrote:
> >
> > >
> > > What should I do to move this forward? Does Avro have a PIP process?
> > >
> > >
> > > On Sat, Jun 29, 2019 at 3:26 PM Erik Erlandson 
> > > wrote:
> > >
> > >>
> > >> Regarding schema, my proposal for fingerprints would be that units are
> > >> fingerprinted based on their canonical form, as defined here
> > >> <
> >
> http://erikerlandson.github.io/blog/2019/05/03/algorithmic-unit-analysis/
> > >.
> > >> Any two unit expressions having the same canonical form (including the
> > >> corresponding coefficients) are exactly equivalent, and so their
> > >> fingerprints can be the same. Possibly the unit could be stored on the
> > >> schema in canonical form by convention, although canonical forms are
> > >> frequently not as intuitive to humans and so in that case the
> > documentation
> > >> value of the unit might be reduced for humans examining the schema.
> > >>
> > >> For schema evolution, a unit change such that the previous and new
> unit
> > >> are convertable (also defined as at the above link) would be well
> > defined,
> > >> and automatic transformation would just be the correct unit conversion
> > >> (e.g. seconds to milliseconds). If the unit changes to a
> non-convertable
> > >> unit (e.g. seconds to bytes) then no automatic transformation exists,
> > and
> > >> attempting to resolve the old and new schema would be an error. Note
> > that
> > >> establishing the conversion assumes that both original and new schemas
> > are
> > >> available at read time.
> > >>
> > >>
> > >> On Sat, Jun 29, 2019 at 11:55 AM Niels Basjes  wrote:
> > >>
> > >>> I think we should approach this idea in two parts:
> > >>>
> > >>> 1) The schema. Things like does a different unit mean a different
> > schema
> > >>> fingerprint even though the bytes remain the same. What does a
> > different
> > >>> unit mean for schema evolution.
> > >>>
> > >>> 2) Language specifics. Scala has different possibilities than Java.
> > >>>
> > >>> On Sat, Jun 29, 2019, 18:59 Erik Erlandson 
> > wrote:
> > >>>
> > >>> > I've been puzzling over what can be done to support this in more
> > >>> > widel

[jira] [Commented] (AVRO-2474) Support a "unit" property of schema fields

2019-07-16 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/AVRO-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886436#comment-16886436
 ] 

Erik Erlandson commented on AVRO-2474:
--

Questions & comments from copied from Ryan Skraba, via email thread:

 
{quote}Would you expect a writer schema that wrote a Fahrenheit field and a
 reader schema that reads Celsius to interact transparently with generic
 data?
{quote}
(With the caveat that there is a lot I don't know about Avro) IIUC, the writer 
schema is saved with written data. So if a writer schema had 
"unit":"fahrenheit", and on input a reader schema had "unit":"celsius", then I 
would expect this to be detected and correctly converted, transparently.

Tangentially: "fahrenheit", "celsius" & "kelvin" are interesting because they 
might denote either a "unit" (a quantity of degrees) or an actual "temperature" 
(having a particular offset). In general, "temperature" is not the same thing 
as a "unit" of degrees, see 
[here|https://github.com/erikerlandson/coulomb#temperature-values]. The upshot 
is that there will be "temperature" attributes on a schema as well as "unit" of 
degrees. Similarly there will be "timestamp" and/or "date", as well as "unit" 
of time, although that is something I haven't added to coulomb yet.
{quote}What about conversions that lose precision (i.e., if the above conversion
 was on an INT field)
{quote}
This is an great question; there is not one obvious policy. In coulomb, my 
default conversion policy is "best effort", which I 
[define|https://github.com/erikerlandson/coulomb/blob/develop/coulomb/src/main/scala/coulomb/unitops/unitops.scala#L183]
 as: translate the input and conversion factor to Rational, multiply, and then 
convert to Integer (or Long, etc). The tradeoff here is some compute cost. 
Other policies could be defined that are faster (and maybe even more aligned 
with standard float to int interactions). I do not have a strong opinion on 
this. I think I'd want to do whatever is most intuitive for most members of the 
community.
{quote}How much of "unit" support should be mandatory in the spec for cross 
language operation?  (a unit-aware Scala writer with a Fahrenheit field and a 
non-unit-aware reader with a Celsius field) To what degree would a generic 
reader of Avro data be required to support quantity wrappers (i.e. how can we 
opt-in/opt-out cleanly from being unit-aware)?
{quote}
If the necessary "unit" information is present on both the write schema and 
reader schema, then I believe this might be made "mandatory" across languages. 
The values themselves (in the code) might not have any unit types attached (as 
I support with coulomb), but the unit fields on the schema could be checked for 
compatibility and converted. In that sense, we might make actual unit-types in 
a language optional. This might be a way to provide meaningful support for 
language that either can't or don't yet support a concept of unit type in the 
code itself. My tentative idea for a policy on this is: if data is written 
using a schema with units, then either the reader-schema must also provide a 
compatible unit, and/or the code must somehow specify the requested unit. 
Otherwise, a read error will be raised.
{quote}At scale, I'd be particularly keen to see the conversion detection 
(between
 two schemas / fields / quantities) take place once, and then the
 calculation reused for all of the subsequent datum passing through, but I'm
 not sure how that would work
{quote}
In my current implementation, a unit conversion factor is computed once, and 
then it is cached on the schema itself, and detected via key lookup on 
subsequent reads. This actually was not nearly as slow as I'd feared, but it is 
still an extra key lookup per read. When the unit is coming from the read call 
(as I currently do it), I am not very sure how to do better. If the write and 
read schema are being resolved in the avro system itself, I can imagine better 
performance, equivalent to just checking a boolean per read. You can see what 
the current code does 
[here|https://github.com/erikerlandson/coulomb/blob/develop/coulomb-avro/src/main/scala/coulomb/avro/package.scala#L51].
 I'm optimistic that Avro schema dev community might have good ideas here.
{quote}(quantity of items, percents and other ratios, ratings)
{quote}
My possibly-controvertial position is that "items", percents and ratings *do* 
have implied units. Ratios of course are likely to be truly unitless, although 
measures such angular degrees, radians, etc, are useful units that are secretly 
derived from "Unitless".

That is, we deal every day with many kinds of units, but because they are not 
SI-units, we are trained to not think of them as unit quantities. As a simple 
example, the length of a vector of Products is an integer value with unit 
"Product", and the length of a vector of People is an integer having unit 
"People

[jira] [Commented] (AVRO-2469) Add data interop test to the Python3 bindings

2019-07-16 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/AVRO-2469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886195#comment-16886195
 ] 

Hudson commented on AVRO-2469:
--

SUCCESS: Integrated in Jenkins build AvroJava #701 (See 
[https://builds.apache.org/job/AvroJava/701/])
AVRO-2469: Add data interop test to the Python3 bindings (#581) (fokko: 
[https://github.com/apache/avro/commit/fcb4764468cd1d70b3341c1488a394bb8f20929b])
* (edit) lang/py3/avro/tests/test_datafile_interop.py
* (edit) lang/py3/avro/tests/gen_interop_data.py
* (edit) build.sh
* (edit) lang/py3/setup.py


> Add data interop test to the Python3 bindings
> -
>
> Key: AVRO-2469
> URL: https://issues.apache.org/jira/browse/AVRO-2469
> Project: Apache Avro
>  Issue Type: Test
>  Components: interop, python
>Reporter: Kengo Seki
>Assignee: Kengo Seki
>Priority: Major
>
> Currently, the Python3 bindings have a test called "TestDataFileInterop", but 
> it's not a real data interop test because it only checks read/write operation 
> within Python3 and doesn't read files generated by other languages.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [jira] [Created] (AVRO-2473) C#: Fix documentation warnings

2019-07-16 Thread Patrick Farry
Happy to review. 

I checked the changes into the current master. I’m not sure how to rebase 
(trying to get up to speed on GitHub) but it had the effect of squashing the 
changes which is probably a good thing. I created a new PR. Github is claiming 
it is mergeable so I’m hoping it is good to go.

> On Jul 15, 2019, at 5:47 PM, Brian Lachniet  wrote:
> 
> Hey Patrick, thank you! I actually have a draft PR up for this now:
> https://github.com/apache/avro/pull/586. I could certainly use a second
> pair of eyes on my changes, if you're willing to review them.
> 
> I want to get your Reflect changes in before we try to merge these changes
> in, though. I started to merge your reflect changes this past weekend but
> screwed up the rebase. Check out my latest comments on your PR
>  if you
> haven't seen them already.
> 
> On Sun, Jul 14, 2019 at 7:14 PM Patrick Farry 
> wrote:
> 
>> want some help with this?
>> 
>> On Sun, Jul 14, 2019, 4:56 AM Brian Lachniet (JIRA) 
>> wrote:
>> 
>>> Brian Lachniet created AVRO-2473:
>>> 
>>> 
>>> Summary: C#: Fix documentation warnings
>>> Key: AVRO-2473
>>> URL: https://issues.apache.org/jira/browse/AVRO-2473
>>> Project: Apache Avro
>>>  Issue Type: Improvement
>>>  Components: csharp
>>>Affects Versions: 1.9.0
>>>Reporter: Brian Lachniet
>>>Assignee: Brian Lachniet
>>> Fix For: 1.10.0, 1.9.1
>>> 
>>> 
>>> Fix the hundreds of documentation warnings in the C# project. These
>>> warnings include malformed documentation as well as missing documentation
>>> on public members.
>>> 
>>> 
>>> 
>>> --
>>> This message was sent by Atlassian JIRA
>>> (v7.6.14#76016)
>>> 
>> 
> 
> 
> -- 
> 
> [image: 51b630b05e01a6d5134ccfd520f547c4.png]
> 
> Brian Lachniet
> 
> Software Engineer
> 
> E: blachn...@gmail.com | blachniet.com 
> 
>  



[jira] [Updated] (AVRO-2469) Add data interop test to the Python3 bindings

2019-07-16 Thread Fokko Driesprong (JIRA)


 [ 
https://issues.apache.org/jira/browse/AVRO-2469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong updated AVRO-2469:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> Add data interop test to the Python3 bindings
> -
>
> Key: AVRO-2469
> URL: https://issues.apache.org/jira/browse/AVRO-2469
> Project: Apache Avro
>  Issue Type: Test
>  Components: interop, python
>Reporter: Kengo Seki
>Assignee: Kengo Seki
>Priority: Major
>
> Currently, the Python3 bindings have a test called "TestDataFileInterop", but 
> it's not a real data interop test because it only checks read/write operation 
> within Python3 and doesn't read files generated by other languages.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (AVRO-2469) Add data interop test to the Python3 bindings

2019-07-16 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/AVRO-2469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886124#comment-16886124
 ] 

ASF subversion and git services commented on AVRO-2469:
---

Commit fcb4764468cd1d70b3341c1488a394bb8f20929b in avro's branch 
refs/heads/master from Kengo Seki
[ https://gitbox.apache.org/repos/asf?p=avro.git;h=fcb4764 ]

AVRO-2469: Add data interop test to the Python3 bindings (#581)

* AVRO-2469: Add data interop test to the Python3 bindings

* Introduce with statement and pathlib to make the syntax clean

* Use DataFileReader and DataFileWriter with "with" statements


> Add data interop test to the Python3 bindings
> -
>
> Key: AVRO-2469
> URL: https://issues.apache.org/jira/browse/AVRO-2469
> Project: Apache Avro
>  Issue Type: Test
>  Components: interop, python
>Reporter: Kengo Seki
>Assignee: Kengo Seki
>Priority: Major
>
> Currently, the Python3 bindings have a test called "TestDataFileInterop", but 
> it's not a real data interop test because it only checks read/write operation 
> within Python3 and doesn't read files generated by other languages.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (AVRO-2469) Add data interop test to the Python3 bindings

2019-07-16 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/AVRO-2469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886123#comment-16886123
 ] 

ASF subversion and git services commented on AVRO-2469:
---

Commit fcb4764468cd1d70b3341c1488a394bb8f20929b in avro's branch 
refs/heads/master from Kengo Seki
[ https://gitbox.apache.org/repos/asf?p=avro.git;h=fcb4764 ]

AVRO-2469: Add data interop test to the Python3 bindings (#581)

* AVRO-2469: Add data interop test to the Python3 bindings

* Introduce with statement and pathlib to make the syntax clean

* Use DataFileReader and DataFileWriter with "with" statements


> Add data interop test to the Python3 bindings
> -
>
> Key: AVRO-2469
> URL: https://issues.apache.org/jira/browse/AVRO-2469
> Project: Apache Avro
>  Issue Type: Test
>  Components: interop, python
>Reporter: Kengo Seki
>Assignee: Kengo Seki
>Priority: Major
>
> Currently, the Python3 bindings have a test called "TestDataFileInterop", but 
> it's not a real data interop test because it only checks read/write operation 
> within Python3 and doesn't read files generated by other languages.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: supporting a "unit" field for avro schema

2019-07-16 Thread Ryan Skraba
Hello!  I've been thinking about this and I generally like the idea of
stronger types with units :D

I have some questions about what you are thinking of when you say "first
class concept" in Avro:
- Would you expect a writer schema that wrote a Fahrenheit field and a
reader schema that reads Celsius to interact transparently with generic
data?
- What about conversions that lose precision (i.e., if the above conversion
was on an INT field)?
- How much of "unit" support should be mandatory in the spec for cross
language operation?  (a unit-aware Scala writer with a Fahrenheit field and
a non-unit-aware reader with a Celsius field).
- To what degree would a generic reader of Avro data be required to support
quantity wrappers (i.e. how can we opt-in/opt-out cleanly from being
unit-aware)?

At scale, I'd be particularly keen to see the conversion detection (between
two schemas / fields / quantities) take place once, and then the
calculation reused for all of the subsequent datum passing through, but I'm
not sure how that would work.

We have some experience with passing a lot of client data through Avro, and
we use generic data quite a bit -- I'd be tempted to think of "float
(metres)" as a distinct type from "float (minutes)", but it would be a huge
(but potentially interesting) change for the way we look at data.  That
being said, as far as units go, we see a lot more unitless values (quantity
of items, percents and other ratios, ratings).  The most frequent numeric
values with units that we see are probably money or geolocation (in
practice, already normalized to lat/long -- although I just learned about
UTM!).  Surprisingly, there's not as much SI-type unit data as you might
expect.

I can definitely see the value of using a "unit" annotation in a generated
specific record for a supported language -- as proven by your scala work!
That might be an easy first target while working out what a first-class
concept in the spec would entail.  I missed Berlin Buzzwords by a day, but
enjoyed the video, thanks!

Ryan



On Tue, Jul 16, 2019 at 1:24 AM Erik Erlandson  wrote:

> If I'm interpreting the situation correctly, there is an "Avro Enhancement
> Proposal", but none have been filed in nearly a decade:
> https://cwiki.apache.org/confluence/display/AVRO/Avro+Enhancement+Proposals
>
> As a start, I submitted a jira to track this idea:
> https://issues.apache.org/jira/browse/AVRO-2474
>
>
>
> On Mon, Jul 8, 2019 at 10:42 AM Erik Erlandson 
> wrote:
>
> >
> > What should I do to move this forward? Does Avro have a PIP process?
> >
> >
> > On Sat, Jun 29, 2019 at 3:26 PM Erik Erlandson 
> > wrote:
> >
> >>
> >> Regarding schema, my proposal for fingerprints would be that units are
> >> fingerprinted based on their canonical form, as defined here
> >> <
> http://erikerlandson.github.io/blog/2019/05/03/algorithmic-unit-analysis/
> >.
> >> Any two unit expressions having the same canonical form (including the
> >> corresponding coefficients) are exactly equivalent, and so their
> >> fingerprints can be the same. Possibly the unit could be stored on the
> >> schema in canonical form by convention, although canonical forms are
> >> frequently not as intuitive to humans and so in that case the
> documentation
> >> value of the unit might be reduced for humans examining the schema.
> >>
> >> For schema evolution, a unit change such that the previous and new unit
> >> are convertable (also defined as at the above link) would be well
> defined,
> >> and automatic transformation would just be the correct unit conversion
> >> (e.g. seconds to milliseconds). If the unit changes to a non-convertable
> >> unit (e.g. seconds to bytes) then no automatic transformation exists,
> and
> >> attempting to resolve the old and new schema would be an error. Note
> that
> >> establishing the conversion assumes that both original and new schemas
> are
> >> available at read time.
> >>
> >>
> >> On Sat, Jun 29, 2019 at 11:55 AM Niels Basjes  wrote:
> >>
> >>> I think we should approach this idea in two parts:
> >>>
> >>> 1) The schema. Things like does a different unit mean a different
> schema
> >>> fingerprint even though the bytes remain the same. What does a
> different
> >>> unit mean for schema evolution.
> >>>
> >>> 2) Language specifics. Scala has different possibilities than Java.
> >>>
> >>> On Sat, Jun 29, 2019, 18:59 Erik Erlandson 
> wrote:
> >>>
> >>> > I've been puzzling over what can be done to support this in more
> >>> > widely-used languages. The dilemma relative to the current language
> >>> > ecosystem is that languages with "modern" type systems (Haskell,
> Rust,
> >>> > Scala, etc) capable of supporting compile-time unit checking, in the
> >>> > particular style I've been exploring, are not yet widely used.
> >>> >
> >>> > With respect to Java, a couple approaches are plausible. One is to
> >>> enhance
> >>> > the language, for example with Java-8 compiler plugins. Another might
> >>> be to
> >>> > implement a unit type 

Re: Should a Schema be serializable in Java?

2019-07-16 Thread Ryan Skraba
Hello!  Thanks to the reference to AVRO-1852. It's exactly what I was
looking for.

I agree that Java serialization shouldn't be used for anything
cross-platform, or (in my opinion) used for any *data* persistence at all.
Especially not for an Avro container file or sending binary data through a
messaging system...

But Java serialization is definitely useful and used for sending instances
of "distributed work" implemented in Java from node to node in a cluster.
I'm not too worried about existing connectors -- we can see that each
framework has "solved" the problem one at a time.  In addition to Flink,
there's
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroUtils.java#L29
 and
https://github.com/apache/spark/blob/3663dbe541826949cecf5e1ea205fe35c163d147/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOutputWriterFactory.scala#L35
.

Specifically, I see the advantage for user-defined distributed functions
that happen to carry along an Avro Schema -- and I can personally say that
I've encountered this a lot in our code!

That being said, I think it's probably overkill to warn the user about the
perils of Java serialization (not being cross-language and requiring
consistent JDKs and libraries across JVMs).  If an error occurs for one of
those reasons, there's a larger problem for the dev to address, and it's
just as likely to occur for any Java library in the job if the environment
is bad.  Related, we've encountered similar issues with logical types
existing in Avro 1.8 in the driver but not in Avro 1.7 on the cluster...
the solution is "make sure you don't do that".  (Looking at you, guava and
jackson!)

The patch in question delegates serialization to the string form of the
schema, so it's basically doing what all of the above Avro "holders" are
doing -- I wouldn't object to having a sample schema available that fully
exercises what a schema can hold, but I also think that Schema.Parser (used
underneath) is currently pretty well tested and mature!

Do you think this could be a candidate for 1.9.1 as a minor improvement?  I
can't think of any reason that this wouldn't be backwards compatible.

Ryan

side note: I wrote java.lang.Serializable earlier, which probably didn't
help my search for prior discussion... :/

On Tue, Jul 16, 2019 at 9:59 AM Ismaël Mejía  wrote:

> This is a good idea even if it may have some issues that we should
> probably document and warn users about:
>
> 1. Java based serialization is really practical for JVM based systems,
> but we should probably add a warning or documentation because Java
> serialization is not deterministic between JVMs so this could be a
> source for issues (usually companies use the same version of the JVM
> so this is less critical, but this still can happen specially now with
> all the different versions of Java and OpenJDK based flavors).
>
> 2. This is not cross language compatible, the String based
> representation (or even an Avro based representation of Schema) can be
> used in every language.
>
> Even with these I think just for ease of use it is worth to make
> Schema Serializable. Is the plan to fully serialize it, or just to
> make it a String and serialize the String as done in the issue Doug
> mentioned?
> If we take the first approach we need to properly test with a Schema
> that has elements of the full specification that (de)-serialization
> works correctly. Does anyone know if we have already a test schema
> that covers the full ‘schema’ specification to reuse it if so?
>
> On Mon, Jul 15, 2019 at 11:46 PM Driesprong, Fokko 
> wrote:
> >
> > Correct me if I'm wrong here. But as far as I understood the way of
> > serializing the schema is using Avro, as it is part of the file. To avoid
> > confusion there should be one way of serializing.
> >
> > However, I'm not sure if this is worth the hassle of not simply
> > implementing serializable. Also Flink there is a rather far from optimal
> > implementation:
> >
> https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/ParquetAvroWriters.java#L72
> > This converts it to JSON and back while distributing the schema to the
> > executors.
> >
> > Cheers, Fokko
> >
> > Op ma 15 jul. 2019 om 23:03 schreef Doug Cutting :
> >
> > > I can't think of a reason Schema should not implement Serializable.
> > >
> > > There's actually already an issue & patch for this:
> > >
> > > https://issues.apache.org/jira/browse/AVRO-1852
> > >
> > > Doug
> > >
> > > On Mon, Jul 15, 2019 at 6:49 AM Ismaël Mejía 
> wrote:
> > >
> > > > +dev@avro.apache.org
> > > >
> > > > On Mon, Jul 15, 2019 at 3:30 PM Ryan Skraba  wrote:
> > > > >
> > > > > Hello!
> > > > >
> > > > > I'm looking for any discussion or reference why the Schema object
> isn't
> > > > serializable -- I'm pretty sure this must have already been discussed
> > > (but
> > > > the keywords +avro +serializable +schema have MANY results in 

Re: Should a Schema be serializable in Java?

2019-07-16 Thread Ismaël Mejía
This is a good idea even if it may have some issues that we should
probably document and warn users about:

1. Java based serialization is really practical for JVM based systems,
but we should probably add a warning or documentation because Java
serialization is not deterministic between JVMs so this could be a
source for issues (usually companies use the same version of the JVM
so this is less critical, but this still can happen specially now with
all the different versions of Java and OpenJDK based flavors).

2. This is not cross language compatible, the String based
representation (or even an Avro based representation of Schema) can be
used in every language.

Even with these I think just for ease of use it is worth to make
Schema Serializable. Is the plan to fully serialize it, or just to
make it a String and serialize the String as done in the issue Doug
mentioned?
If we take the first approach we need to properly test with a Schema
that has elements of the full specification that (de)-serialization
works correctly. Does anyone know if we have already a test schema
that covers the full ‘schema’ specification to reuse it if so?

On Mon, Jul 15, 2019 at 11:46 PM Driesprong, Fokko  wrote:
>
> Correct me if I'm wrong here. But as far as I understood the way of
> serializing the schema is using Avro, as it is part of the file. To avoid
> confusion there should be one way of serializing.
>
> However, I'm not sure if this is worth the hassle of not simply
> implementing serializable. Also Flink there is a rather far from optimal
> implementation:
> https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/ParquetAvroWriters.java#L72
> This converts it to JSON and back while distributing the schema to the
> executors.
>
> Cheers, Fokko
>
> Op ma 15 jul. 2019 om 23:03 schreef Doug Cutting :
>
> > I can't think of a reason Schema should not implement Serializable.
> >
> > There's actually already an issue & patch for this:
> >
> > https://issues.apache.org/jira/browse/AVRO-1852
> >
> > Doug
> >
> > On Mon, Jul 15, 2019 at 6:49 AM Ismaël Mejía  wrote:
> >
> > > +dev@avro.apache.org
> > >
> > > On Mon, Jul 15, 2019 at 3:30 PM Ryan Skraba  wrote:
> > > >
> > > > Hello!
> > > >
> > > > I'm looking for any discussion or reference why the Schema object isn't
> > > serializable -- I'm pretty sure this must have already been discussed
> > (but
> > > the keywords +avro +serializable +schema have MANY results in all the
> > > searches I did: JIRA, stack overflow, mailing list, web)
> > > >
> > > > In particular, I was at a demo today where we were asked why Schemas
> > > needed to be passed as strings to run in distributed tasks.  I remember
> > > running into this problem years ago with MapReduce, and again in Spark,
> > and
> > > again in Beam...
> > > >
> > > > Is there any downside to making a Schema implement
> > > java.lang.Serializable?  The only thing I can think of is that the schema
> > > _should not_ be serialized with the data, and making it non-serializable
> > > loosely enforces this (at the cost of continually writing different
> > > flavours of "Avro holders" for when you really do want to serialize it).
> > > >
> > > > Willing to create a JIRA and work on the implementation, of course!
> > > >
> > > > All my best, Ryan
> > >
> >