Re: Should a Schema be serializable in Java?

Ryan Skraba Tue, 16 Jul 2019 02:17:42 -0700

Hello!  Thanks to the reference to AVRO-1852. It's exactly what I was
looking for.

I agree that Java serialization shouldn't be used for anything
cross-platform, or (in my opinion) used for any *data* persistence at all.
Especially not for an Avro container file or sending binary data through a
messaging system...

But Java serialization is definitely useful and used for sending instances
of "distributed work" implemented in Java from node to node in a cluster.
I'm not too worried about existing connectors -- we can see that each
framework has "solved" the problem one at a time.  In addition to Flink,
there's
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroUtils.java#L29
 and
https://github.com/apache/spark/blob/3663dbe541826949cecf5e1ea205fe35c163d147/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOutputWriterFactory.scala#L35
.

Specifically, I see the advantage for user-defined distributed functions
that happen to carry along an Avro Schema -- and I can personally say that
I've encountered this a lot in our code!

That being said, I think it's probably overkill to warn the user about the
perils of Java serialization (not being cross-language and requiring
consistent JDKs and libraries across JVMs).  If an error occurs for one of
those reasons, there's a larger problem for the dev to address, and it's
just as likely to occur for any Java library in the job if the environment
is bad.  Related, we've encountered similar issues with logical types
existing in Avro 1.8 in the driver but not in Avro 1.7 on the cluster...
the solution is "make sure you don't do that".  (Looking at you, guava and
jackson!)

The patch in question delegates serialization to the string form of the
schema, so it's basically doing what all of the above Avro "holders" are
doing -- I wouldn't object to having a sample schema available that fully
exercises what a schema can hold, but I also think that Schema.Parser (used
underneath) is currently pretty well tested and mature!

Do you think this could be a candidate for 1.9.1 as a minor improvement?  I
can't think of any reason that this wouldn't be backwards compatible.

Ryan

side note: I wrote java.lang.Serializable earlier, which probably didn't
help my search for prior discussion... :/

On Tue, Jul 16, 2019 at 9:59 AM Ismaël Mejía <ieme...@gmail.com> wrote:

> This is a good idea even if it may have some issues that we should
> probably document and warn users about:
>
> 1. Java based serialization is really practical for JVM based systems,
> but we should probably add a warning or documentation because Java
> serialization is not deterministic between JVMs so this could be a
> source for issues (usually companies use the same version of the JVM
> so this is less critical, but this still can happen specially now with
> all the different versions of Java and OpenJDK based flavors).
>
> 2. This is not cross language compatible, the String based
> representation (or even an Avro based representation of Schema) can be
> used in every language.
>
> Even with these I think just for ease of use it is worth to make
> Schema Serializable. Is the plan to fully serialize it, or just to
> make it a String and serialize the String as done in the issue Doug
> mentioned?
> If we take the first approach we need to properly test with a Schema
> that has elements of the full specification that (de)-serialization
> works correctly. Does anyone know if we have already a test schema
> that covers the full ‘schema’ specification to reuse it if so?
>
> On Mon, Jul 15, 2019 at 11:46 PM Driesprong, Fokko <fo...@driesprong.frl>
> wrote:
> >
> > Correct me if I'm wrong here. But as far as I understood the way of
> > serializing the schema is using Avro, as it is part of the file. To avoid
> > confusion there should be one way of serializing.
> >
> > However, I'm not sure if this is worth the hassle of not simply
> > implementing serializable. Also Flink there is a rather far from optimal
> > implementation:
> >
> https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/ParquetAvroWriters.java#L72
> > This converts it to JSON and back while distributing the schema to the
> > executors.
> >
> > Cheers, Fokko
> >
> > Op ma 15 jul. 2019 om 23:03 schreef Doug Cutting <cutt...@gmail.com>:
> >
> > > I can't think of a reason Schema should not implement Serializable.
> > >
> > > There's actually already an issue & patch for this:
> > >
> > > https://issues.apache.org/jira/browse/AVRO-1852
> > >
> > > Doug
> > >
> > > On Mon, Jul 15, 2019 at 6:49 AM Ismaël Mejía <ieme...@gmail.com>
> wrote:
> > >
> > > > +d...@avro.apache.org
> > > >
> > > > On Mon, Jul 15, 2019 at 3:30 PM Ryan Skraba <r...@skraba.com> wrote:
> > > > >
> > > > > Hello!
> > > > >
> > > > > I'm looking for any discussion or reference why the Schema object
> isn't
> > > > serializable -- I'm pretty sure this must have already been discussed
> > > (but
> > > > the keywords +avro +serializable +schema have MANY results in all the
> > > > searches I did: JIRA, stack overflow, mailing list, web)
> > > > >
> > > > > In particular, I was at a demo today where we were asked why
> Schemas
> > > > needed to be passed as strings to run in distributed tasks.  I
> remember
> > > > running into this problem years ago with MapReduce, and again in
> Spark,
> > > and
> > > > again in Beam...
> > > > >
> > > > > Is there any downside to making a Schema implement
> > > > java.lang.Serializable?  The only thing I can think of is that the
> schema
> > > > _should not_ be serialized with the data, and making it
> non-serializable
> > > > loosely enforces this (at the cost of continually writing different
> > > > flavours of "Avro holders" for when you really do want to serialize
> it).
> > > > >
> > > > > Willing to create a JIRA and work on the implementation, of course!
> > > > >
> > > > > All my best, Ryan
> > > >
> > >
>

Re: Should a Schema be serializable in Java?

Reply via email to