Re: Schema-Aware PCollections revisited

Reuven Lax Sun, 04 Feb 2018 11:24:06 -0800

Cool, let's chat about this on slack for a bit (which I realized I've been
signed out of for some time).


Reuven

On Sun, Feb 4, 2018 at 9:21 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Sorry guys, I was off today. Happy to be part of the party too ;)
>
> Regards
> JB
>
> On 02/04/2018 06:19 PM, Reuven Lax wrote:
> > Romain, since you're interested maybe the two of us should put together a
> > proposal for how to set this things (hints, schema) on PCollections? I
> don't
> > think it'll be hard - the previous list thread on hints already agreed
> on a
> > general approach, and we would just need to flesh it out.
> >
> > BTW in the past when I looked, Json schemas seemed to have some odd
> limitations
> > inherited from Javascript (e.g. no distinction between integer and
> > floating-point types). Is that still true?
> >
> > Reuven
> >
> > On Sun, Feb 4, 2018 at 9:12 AM, Romain Manni-Bucau <
> rmannibu...@gmail.com
> > <mailto:rmannibu...@gmail.com>> wrote:
> >
> >
> >
> >     2018-02-04 17:53 GMT+01:00 Reuven Lax <re...@google.com
> >     <mailto:re...@google.com>>:
> >
> >
> >
> >         On Sun, Feb 4, 2018 at 8:42 AM, Romain Manni-Bucau
> >         <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>> wrote:
> >
> >
> >             2018-02-04 17:37 GMT+01:00 Reuven Lax <re...@google.com
> >             <mailto:re...@google.com>>:
> >
> >                 I'm not sure where proto comes from here. Proto is one
> example
> >                 of a type that has a schema, but only one example.
> >
> >                 1. In the initial prototype I want to avoid modifying the
> >                 PCollection API. So I think it's best to create a special
> >                 SchemaCoder, and pass the schema into this coder. Later
> we might
> >                 targeted APIs for this instead of going through a coder.
> >                 1.a I don't see what hints have to do with this?
> >
> >
> >             Hints are a way to replace the new API and unify the way to
> pass
> >             metadata in beam instead of adding a new custom way each
> time.
> >
> >
> >         I don't think schema is a hint. But I hear what your saying -
> hint is a
> >         type of PCollection metadata as is schema, and we should have a
> unified
> >         API for setting such metadata.
> >
> >
> >     :), Ismael pointed me out earlier this week that "hint" had an old
> meaning
> >     in beam. My usage is purely the one done in most EE spec (your
> "metadata" in
> >     previous answer). But guess we are aligned on the meaning now, just
> wanted
> >     to be sure.
> >
> >
> >
> >
> >
> >
> >
> >                 2. BeamSQL already has a generic record type which fits
> this use
> >                 case very well (though we might modify it). However as
> mentioned
> >                 in the doc, the user is never forced to use this generic
> record
> >                 type.
> >
> >
> >             Well yes and not. A type already exists but 1. it is very
> strictly
> >             limited (flat/columns only which is very few of what big
> data SQL
> >             can do) and 2. it must be aligned on the converge of generic
> data
> >             the schema will bring (really read "aligned" as "dropped in
> favor
> >             of" - deprecated being a smooth way to do it).
> >
> >
> >         As I said the existing class needs to be modified and extended,
> and not
> >         just for this schema us was. It was meant to represent Calcite
> SQL rows,
> >         but doesn't quite even do that yet (Calcite supports nested
> rows).
> >         However I think it's the right basis to start from.
> >
> >
> >     Agree on the state. Current impl issues I hit (additionally to the
> nested
> >     support which would require by itself a kind of visitor solution)
> are the
> >     fact to own the schema in the record and handle field by field the
> >     serialization instead of as a whole which is how it would be handled
> with a
> >     schema IMHO.
> >
> >     Concretely what I don't want is to do a PoC which works - they all
> work
> >     right? and integrate to beam without thinking to a global solution
> for this
> >     generic record issue and its schema standardization. This is where
> Json(-P)
> >     has a lot of value IMHO but requires a bit more love than just
> adding schema
> >     in the model.
> >
> >
> >
> >
> >
> >             So long story short the main work of this schema track is
> not only
> >             on using schema in runners and other ways but also starting
> to make
> >             beam consistent with itself which is probably the most
> important
> >             outcome since it is the user facing side of this work.
> >
> >
> >
> >                 On Sun, Feb 4, 2018 at 12:22 AM, Romain Manni-Bucau
> >                 <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>>
> wrote:
> >
> >                     @Reuven: is the proto only about passing schema or
> also the
> >                     generic type?
> >
> >                     There are 2.5 topics to solve this issue:
> >
> >                     1. How to pass schema
> >                     1.a. hints?
> >                     2. What is the generic record type associated to a
> schema
> >                     and how to express a schema relatively to it
> >
> >                     I would be happy to help on 1.a and 2 somehow if you
> need.
> >
> >                     Le 4 févr. 2018 03:30, "Reuven Lax" <
> re...@google.com
> >                     <mailto:re...@google.com>> a écrit :
> >
> >                         One more thing. If anyone here has experience
> with
> >                         various OSS metadata stores (e.g. Kafka Schema
> Registry
> >                         is one example), would you like to collaborate on
> >                         implementation? I want to make sure that source
> schemas
> >                         can be stored in a variety of OSS metadata
> stores, and
> >                         be easily pulled into a Beam pipeline.
> >
> >                         Reuven
> >
> >                         On Sat, Feb 3, 2018 at 6:28 PM, Reuven Lax
> >                         <re...@google.com <mailto:re...@google.com>>
> wrote:
> >
> >                             Hi all,
> >
> >                             If there are no concerns, I would like to
> start
> >                             working on a prototype. It's just a
> prototype, so I
> >                             don't think it will have the final API (e.g.
> for the
> >                             prototype I'm going to avoid change the API
> of
> >                             PCollection, and use a "special" Coder
> instead).
> >                             Also even once we go beyond prototype, it
> will be
> >                             @Experimental for some time, so the API will
> not be
> >                             fixed in stone.
> >
> >                             Any more comments on this approach before we
> start
> >                             implementing a prototype?
> >
> >                             Reuven
> >
> >                             On Wed, Jan 31, 2018 at 1:12 PM, Romain
> Manni-Bucau
> >                             <rmannibu...@gmail.com
> >                             <mailto:rmannibu...@gmail.com>> wrote:
> >
> >                                 If you need help on the json part I'm
> happy to
> >                                 help. To give a few hints on what is very
> >                                 doable: we can add an avro module to
> johnzon
> >                                 (asf json{p,b} impl) to back jsonp by
> avro
> >                                 (guess it will be one of the first to be
> asked)
> >                                 for instance.
> >
> >
> >                                 Romain Manni-Bucau
> >                                 @rmannibucau <https://twitter.com/
> rmannibucau> |
> >                                  Blog <https://rmannibucau.metawerx.net/> |
> Old
> >                                 Blog <http://rmannibucau.wordpress.com>
> | Github
> >                                 <https://github.com/rmannibucau> |
> LinkedIn
> >                                 <https://www.linkedin.com/in/rmannibucau
> >
> >
> >                                 2018-01-31 22:06 GMT+01:00 Reuven Lax
> >                                 <re...@google.com <mailto:
> re...@google.com>>:
> >
> >                                     Agree. The initial implementation
> will be a
> >                                     prototype.
> >
> >                                     On Wed, Jan 31, 2018 at 12:21 PM,
> >                                     Jean-Baptiste Onofré <
> j...@nanthrax.net
> >                                     <mailto:j...@nanthrax.net>> wrote:
> >
> >                                         Hi Reuven,
> >
> >                                         Agree to be able to describe the
> schema
> >                                         with different format. The good
> point
> >                                         about json schemas is that they
> are
> >                                         described by a spec. My point is
> also to
> >                                         avoid the reinvent the wheel.
> Just an
> >                                         abstract to be able to use Avro,
> Json,
> >                                         Calcite, custom schema
> descriptors would
> >                                         be great.
> >
> >                                         Using coder to describe a schema
> sounds
> >                                         like a smart move to implement
> quickly.
> >                                         However, it has to be clear in
> term of
> >                                         documentation to avoid "side
> effect". I
> >                                         still think
> PCollection.setSchema() is
> >                                         better: it should be metadata
> (or hint
> >                                         ;))) on the PCollection.
> >
> >                                         Regards
> >                                         JB
> >
> >                                         On 31/01/2018 20:16, Reuven Lax
> wrote:
> >
> >                                             As to the question of how a
> schema
> >                                             should be specified, I want
> to
> >                                             support several common schema
> >                                             formats. So if a user has a
> Json
> >                                             schema, or an Avro schema,
> or a
> >                                             Calcite schema, etc. there
> should be
> >                                             adapters that allow setting
> a schema
> >                                             from any of them. I don't
> think we
> >                                             should prefer one over the
> other.
> >                                             While Romain is right that
> many
> >                                             people know Json, I think
> far fewer
> >                                             people know Json schemas.
> >
> >                                             Agree, schemas should not be
> >                                             enforced (for one thing, that
> >                                             wouldn't be backwards
> compatible!).
> >                                             I think for the initial
> prototype I
> >                                             will probably use a special
> coder to
> >                                             represent the schema (with
> setSchema
> >                                             an option on the coder),
> largely
> >                                             because it doesn't require
> modifying
> >                                             PCollection. However I think
> longer
> >                                             term a schema should be an
> optional
> >                                             piece of metadata on the
> PCollection
> >                                             object. Similar to the
> previous
> >                                             discussion about "hints," I
> think
> >                                             this can be set on the
> producing
> >                                             PTransform, and a SetSchema
> >                                             PTransform will allow
> attaching a
> >                                             schema to any PCollection
> (i.e.
> >
>  pc.apply(SetSchema.of(schema))).
> >                                             This part isn't designed
> yet, but I
> >                                             think schema should be
> similar to
> >                                             hints, it's just another
> piece of
> >                                             metadata on the PCollection
> (though
> >                                             something interpreted by the
> model,
> >                                             where hints are interpreted
> by the
> >                                             runner)
> >
> >                                             Reuven
> >
> >                                             On Tue, Jan 30, 2018 at 1:37
> AM,
> >                                             Jean-Baptiste Onofré
> >                                             <j...@nanthrax.net
> >                                             <mailto:j...@nanthrax.net>
> >                                             <mailto:j...@nanthrax.net
> >                                             <mailto:j...@nanthrax.net>>>
> wrote:
> >
> >                                                 Hi,
> >
> >                                                 I think we should avoid
> to mix
> >                                             two things in the discussion
> (and so
> >                                                 the document):
> >
> >                                                 1. The element of the
> collection
> >                                             and the schema itself are two
> >                                                 different things.
> >                                                 By essence, Beam should
> not
> >                                             enforce any schema. That's
> why I think
> >                                                 it's a good
> >                                                 idea to set the schema
> >                                             optionally on the PCollection
> >
> (pcollection.setSchema()).
> >
> >                                                 2. From point 1 comes two
> >                                             questions: how do we
> represent a
> >                                             schema ?
> >                                                 How can we
> >                                                 leverage the schema to
> simplify
> >                                             the serialization of the
> element in the
> >                                                 PCollection and query ?
> These
> >                                             two questions are not
> directly related.
> >
> >                                                   2.1 How do we
> represent the schema
> >                                                 Json Schema is a very
> >                                             interesting idea. It could
> be an
> >                                             abstract and
> >                                                 other
> >                                                 providers, like Avro,
> can be
> >                                             bind on it. It's part of the
> json
> >                                                 processing spec
> >                                                 (javax).
> >
> >                                                   2.2. How do we
> leverage the
> >                                             schema for query and
> serialization
> >                                                 Also in the spec, json
> pointer
> >                                             is interesting for the
> querying.
> >                                                 Regarding the
> >                                                 serialization, jackson
> or other
> >                                             data binder can be used.
> >
> >                                                 It's still rough ideas
> in my
> >                                             mind, but I like Romain's
> idea about
> >                                                 json-p usage.
> >
> >                                                 Once 2.3.0 release is
> out, I
> >                                             will start to update the
> document with
> >                                                 those ideas,
> >                                                 and PoC.
> >
> >                                                 Thanks !
> >                                                 Regards
> >                                                 JB
> >
> >                                                 On 01/30/2018 08:42 AM,
> Romain
> >                                             Manni-Bucau wrote:
> >                                                 >
> >                                                 >
> >                                                 > Le 30 janv. 2018 01:09,
> >                                             "Reuven Lax" <
> re...@google.com
> >                                             <mailto:re...@google.com>
> >                                             <mailto:re...@google.com
> >                                             <mailto:re...@google.com>>
> >                                                  > <mailto:
> re...@google.com
> >                                             <mailto:re...@google.com>
> >                                             <mailto:re...@google.com
> >                                             <mailto:re...@google.com>>>>
> a écrit :
> >                                                 >
> >                                                 >
> >                                                 >
> >                                                 >     On Mon, Jan 29,
> 2018 at
> >                                             12:17 PM, Romain Manni-Bucau
> >                                             <rmannibu...@gmail.com
> >                                             <mailto:
> rmannibu...@gmail.com>
> >                                             <mailto:
> rmannibu...@gmail.com
> >                                             <mailto:
> rmannibu...@gmail.com>>
> >                                                  >
> >                                              <mailto:
> rmannibu...@gmail.com
> >                                             <mailto:
> rmannibu...@gmail.com>
> >
> >                                                 <mailto:
> rmannibu...@gmail.com
> >                                             <mailto:
> rmannibu...@gmail.com>>>> wrote:
> >                                                  >
> >                                                  >         Hi
> >                                                  >
> >                                                  >         I have some
> questions
> >                                             on this: how hierarchic
> schemas
> >                                                 would work? Seems
> >                                                  >         it is not
> really
> >                                             supported by the ecosystem
> (out of
> >                                                 custom stuff) :(.
> >                                                  >         How would it
> >                                             integrate smoothly with other
> >                                             generic record
> >                                                 types - N bridges?
> >                                                  >
> >                                                  >
> >                                                  >     Do you mean nested
> >                                             schemas? What do you mean
> here?
> >                                                  >
> >                                                  >
> >                                                  > Yes, sorry - wrote
> the mail
> >                                             too late ;). Was hierarchic
> data and
> >                                                 nested schemas.
> >                                                  >
> >                                                  >
> >                                                  >         Concretely I
> wonder
> >                                             if using json API couldnt be
> >                                                 beneficial: json-p is a
> >                                                  >         nice generic
> >                                             abstraction with a built in
> querying
> >                                                 mecanism (jsonpointer)
> >                                                  >         but no actual
> >                                             serialization (even if json
> and
> >                                             binary json
> >                                                 are very
> >                                                  >         natural). The
> big
> >                                             advantage is to have a well
> known
> >                                                 ecosystem - who
> >                                                  >         doesnt know
> json
> >                                             today? - that beam can reuse
> for free:
> >                                                 JsonObject
> >                                                  >         (guess we
> dont want
> >                                             JsonValue abstraction) for
> the record
> >                                                 type,
> >                                                  >         jsonschema
> standard
> >                                             for the schema, jsonpointer
> for the
> >                                                  >
>  delection/projection
> >                                             etc... It doesnt enforce the
> actual
> >                                                 serialization
> >                                                  >         (json, smile,
> avro,
> >                                             ...) but provide an
> expressive and
> >                                                 alread known API
> >                                                  >         so i see it
> as a big
> >                                             win-win for users (no need
> to learn
> >                                                 a new API and
> >                                                  >         use N bridges
> in all
> >                                             ways) and beam (impls are
> here and
> >                                                 API design
> >                                                  >         already
> thought).
> >                                                  >
> >                                                  >
> >                                                  >     I assume you're
> talking
> >                                             about the API for setting
> schemas,
> >                                                 not using them.
> >                                                  >     Json has many
> downsides
> >                                             and I'm not sure it's true
> that
> >                                                 everyone knows it;
> >                                                  >     there are also
> competing
> >                                             schema APIs, such as Avro
> etc..
> >                                                 However I think we
> >                                                  >     should give Json
> a fair
> >                                             evaluation before dismissing
> it.
> >                                                  >
> >                                                  >
> >                                                  > It is a wider topic
> than
> >                                             schema. Actually schema are
> not the
> >                                                 first citizen but a
> >                                                  > generic data
> representation
> >                                             is. That is where json hits
> almost
> >                                                 any other API.
> >                                                  > Then, when it comes to
> >                                             schema, json has a standard
> for that
> >                                             so we
> >                                                 are all good.
> >                                                  >
> >                                                  > Also json has a good
> indexing
> >                                             API compared to alternatives
> which
> >                                                 are sometimes a
> >                                                  > bit faster - for noop
> >                                             transforms - but are hardly
> usable
> >                                             or make
> >                                                 the code not
> >                                                  > that readable.
> >                                                  >
> >                                                  > Avro is a nice
> competitor but
> >                                             it is compatible - actually
> avro is
> >                                                 json driven by
> >                                                  > design - but its API
> is far
> >                                             to be that easy due to its
> schema
> >                                                 enforcement which
> >                                                  > is heavvvyyy and
> worse is you
> >                                             cant work with avro without a
> >                                                 schema. Json would
> >                                                  > allow to reconciliate
> the
> >                                             dynamic and static cases
> since the job
> >                                                 wouldnt change
> >                                                  > except the setschema.
> >                                                  >
> >                                                  > That is why I think
> json is a
> >                                             good compromise and having a
> >                                                 standard API for it
> >                                                  > allow to fully
> customize the
> >                                             imol as will if needed -
> even using
> >                                                 avro or protobuf.
> >                                                  >
> >                                                  > Side note on beam
> api: i dont
> >                                             think it is good to use a
> main API
> >                                                 for runner
> >                                                  > optimization. It
> enforces
> >                                             something to be shared on
> all runners
> >                                                 but not widely
> >                                                  > usable. It is also
> misleading
> >                                             for users. Would you set a
> flink
> >                                                 pipeline option
> >                                                  > with dataflow? My
> proposal
> >                                             here is to use hints -
> properties -
> >                                                 instead of
> >                                                  > something hardly
> defined in
> >                                             the API then standardize it
> if all
> >                                                 runners support it.
> >                                                  >
> >                                                  >
> >                                                  >
> >                                                  >         Wdyt?
> >                                                  >
> >                                                  >         Le 29 janv.
> 2018
> >                                             06:24, "Jean-Baptiste Onofré"
> >                                                 <j...@nanthrax.net
> >                                             <mailto:j...@nanthrax.net>
> >                                             <mailto:j...@nanthrax.net
> >                                             <mailto:j...@nanthrax.net>>
> >                                                  >
> >                                              <mailto:j...@nanthrax.net
> >                                             <mailto:j...@nanthrax.net>
> >                                             <mailto:j...@nanthrax.net
> >                                             <mailto:j...@nanthrax.net>>>>
> a écrit :
> >
> >                                                  >
> >                                                  >             Hi Reuven,
> >                                                  >
> >                                                  >             Thanks
> for the
> >                                             update ! As I'm working with
> you on
> >                                                 this, I fully
> >                                                  >             agree and
> great
> >                                                  >             doc
> gathering the
> >                                             ideas.
> >                                                  >
> >                                                  >             It's
> clearly
> >                                             something we have to add
> asap in Beam,
> >                                                 because it would
> >                                                  >             allow new
> >                                                  >             use cases
> for our
> >                                             users (in a simple way) and
> open
> >                                                 new areas for the
> >                                                  >             runners
> >                                                  >             (for
> instance
> >                                             dataframe support in the
> Spark runner).
> >                                                  >
> >                                                  >             By the
> way, while
> >                                             ago, I created BEAM-3437 to
> track
> >                                                 the PoC/PR
> >                                                  >             around
> this.
> >                                                  >
> >                                                  >             Thanks !
> >                                                  >
> >                                                  >             Regards
> >                                                  >             JB
> >                                                  >
> >                                                  >             On
> 01/29/2018
> >                                             02:08 AM, Reuven Lax wrote:
> >                                                  >             >
> Previously I
> >                                             submitted a proposal for
> adding
> >                                                 schemas as a
> >                                                  >
>  first-class
> >                                             concept on
> >                                                  >             > Beam
> >                                             PCollections. The proposal
> >                                             engendered quite a
> >                                                 bit of
> >                                                  >
>  discussion from the
> >                                                  >             >
> community -
> >                                             more discussion than I've
> seen from
> >                                                 almost any of our
> >                                                  >             proposals
> to
> >                                                  >             > date!
> >                                                  >             >
> >                                                  >             > Based
> on the
> >                                             feedback and comments, I
> reworked the
> >                                                 proposal
> >                                                  >             document
> quite a
> >                                                  >             > bit. It
> now
> >                                             talks more explicitly about
> the
> >                                                 different between
> >                                                  >             dynamic
> schemas
> >                                                  >             > (where
> the
> >                                             schema is not fully not know
> at
> >                                                 graph-creation time),
> >                                                  >             and static
> >                                                  >             > schemas
> (which
> >                                             are fully know at
> graph-creation
> >                                                 time). Proposed
> >                                                  >             APIs are
> more
> >                                                  >             > fleshed
> out now
> >                                             (again thanks to feedback
> from
> >                                                 community members),
> >                                                  >             and the
> >                                                  >             >
> document talks
> >                                             in more detail about
> evolving schemas in
> >                                                  >
>  long-running
> >                                             streaming
> >                                                  >             >
> pipelines.
> >                                                  >             >
> >                                                  >             > Please
> take a
> >                                             look. I think this will be
> very
> >                                                 valuable to Beam,
> >                                                  >             and
> welcome any
> >                                                  >             >
> feedback.
> >                                                  >             >
> >                                                  >             >
> >                                                  >
> >
> >                                             https://docs.google.com/
> document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#
> >                                             <https://docs.google.com/
> document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#>
> >
> >                                             <https://docs.google.com/
> document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#
> >                                             <https://docs.google.com/
> document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#>>
> >                                                  >
> >                                              <https://docs.google.com/
> document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#
> >                                             <https://docs.google.com/
> document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#>
> >                                             <https://docs.google.com/
> document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#
> >                                             <https://docs.google.com/
> document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#>>>
> >                                                  >             >
> >                                                  >             > Reuven
> >                                                  >
> >                                                  >             --
> >                                                  >
>  Jean-Baptiste Onofré
> >                                                  > jbono...@apache.org
> >                                             <mailto:jbono...@apache.org>
> >                                             <mailto:jbono...@apache.org
> >                                             <mailto:jbono...@apache.org
> >>
> >                                                 <mailto:
> jbono...@apache.org
> >                                             <mailto:jbono...@apache.org>
> >                                             <mailto:jbono...@apache.org
> >                                             <mailto:jbono...@apache.org
> >>>
> >                                                  >
> http://blog.nanthrax.net
> >                                                  >             Talend -
> >                                             http://www.talend.com
> >                                                  >
> >                                                  >
> >                                                  >
> >
> >                                                 --
> >                                                 Jean-Baptiste Onofré
> >                                                 jbono...@apache.org
> >                                             <mailto:jbono...@apache.org>
> >                                             <mailto:jbono...@apache.org
> >                                             <mailto:jbono...@apache.org
> >>
> >                                                 http://blog.nanthrax.net
> >                                                 Talend -
> http://www.talend.com
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
>

Re: Schema-Aware PCollections revisited

Reply via email to