Re: Schema-Aware PCollections revisited

Jean-Baptiste Onofré Sun, 04 Feb 2018 09:21:33 -0800

Sorry guys, I was off today. Happy to be part of the party too ;)

Regards
JB


On 02/04/2018 06:19 PM, Reuven Lax wrote:
> Romain, since you're interested maybe the two of us should put together a
> proposal for how to set this things (hints, schema) on PCollections? I don't
> think it'll be hard - the previous list thread on hints already agreed on a
> general approach, and we would just need to flesh it out.
> 
> BTW in the past when I looked, Json schemas seemed to have some odd 
> limitations
> inherited from Javascript (e.g. no distinction between integer and
> floating-point types). Is that still true?
> 
> Reuven
> 
> On Sun, Feb 4, 2018 at 9:12 AM, Romain Manni-Bucau <rmannibu...@gmail.com
> <mailto:rmannibu...@gmail.com>> wrote:
> 
> 
> 
>     2018-02-04 17:53 GMT+01:00 Reuven Lax <re...@google.com
>     <mailto:re...@google.com>>:
> 
> 
> 
>         On Sun, Feb 4, 2018 at 8:42 AM, Romain Manni-Bucau
>         <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>> wrote:
> 
> 
>             2018-02-04 17:37 GMT+01:00 Reuven Lax <re...@google.com
>             <mailto:re...@google.com>>:
> 
>                 I'm not sure where proto comes from here. Proto is one example
>                 of a type that has a schema, but only one example.
> 
>                 1. In the initial prototype I want to avoid modifying the
>                 PCollection API. So I think it's best to create a special
>                 SchemaCoder, and pass the schema into this coder. Later we 
> might
>                 targeted APIs for this instead of going through a coder.
>                 1.a I don't see what hints have to do with this? 
> 
> 
>             Hints are a way to replace the new API and unify the way to pass
>             metadata in beam instead of adding a new custom way each time.
> 
> 
>         I don't think schema is a hint. But I hear what your saying - hint is 
> a
>         type of PCollection metadata as is schema, and we should have a 
> unified
>         API for setting such metadata. 
> 
> 
>     :), Ismael pointed me out earlier this week that "hint" had an old meaning
>     in beam. My usage is purely the one done in most EE spec (your "metadata" 
> in
>     previous answer). But guess we are aligned on the meaning now, just wanted
>     to be sure.
>      
> 
>          
> 
>              
> 
> 
>                 2. BeamSQL already has a generic record type which fits this 
> use
>                 case very well (though we might modify it). However as 
> mentioned
>                 in the doc, the user is never forced to use this generic 
> record
>                 type.
> 
> 
>             Well yes and not. A type already exists but 1. it is very strictly
>             limited (flat/columns only which is very few of what big data SQL
>             can do) and 2. it must be aligned on the converge of generic data
>             the schema will bring (really read "aligned" as "dropped in favor
>             of" - deprecated being a smooth way to do it).
> 
> 
>         As I said the existing class needs to be modified and extended, and 
> not
>         just for this schema us was. It was meant to represent Calcite SQL 
> rows,
>         but doesn't quite even do that yet (Calcite supports nested rows).
>         However I think it's the right basis to start from.
> 
> 
>     Agree on the state. Current impl issues I hit (additionally to the nested
>     support which would require by itself a kind of visitor solution) are the
>     fact to own the schema in the record and handle field by field the
>     serialization instead of as a whole which is how it would be handled with 
> a
>     schema IMHO.
> 
>     Concretely what I don't want is to do a PoC which works - they all work
>     right? and integrate to beam without thinking to a global solution for 
> this
>     generic record issue and its schema standardization. This is where 
> Json(-P)
>     has a lot of value IMHO but requires a bit more love than just adding 
> schema
>     in the model.
>      
> 
>          
> 
> 
>             So long story short the main work of this schema track is not only
>             on using schema in runners and other ways but also starting to 
> make
>             beam consistent with itself which is probably the most important
>             outcome since it is the user facing side of this work.
>              
> 
> 
>                 On Sun, Feb 4, 2018 at 12:22 AM, Romain Manni-Bucau
>                 <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>> wrote:
> 
>                     @Reuven: is the proto only about passing schema or also 
> the
>                     generic type?
> 
>                     There are 2.5 topics to solve this issue:
> 
>                     1. How to pass schema
>                     1.a. hints?
>                     2. What is the generic record type associated to a schema
>                     and how to express a schema relatively to it
> 
>                     I would be happy to help on 1.a and 2 somehow if you need.
> 
>                     Le 4 févr. 2018 03:30, "Reuven Lax" <re...@google.com
>                     <mailto:re...@google.com>> a écrit :
> 
>                         One more thing. If anyone here has experience with
>                         various OSS metadata stores (e.g. Kafka Schema 
> Registry
>                         is one example), would you like to collaborate on
>                         implementation? I want to make sure that source 
> schemas
>                         can be stored in a variety of OSS metadata stores, and
>                         be easily pulled into a Beam pipeline.
> 
>                         Reuven
> 
>                         On Sat, Feb 3, 2018 at 6:28 PM, Reuven Lax
>                         <re...@google.com <mailto:re...@google.com>> wrote:
> 
>                             Hi all,
> 
>                             If there are no concerns, I would like to start
>                             working on a prototype. It's just a prototype, so 
> I
>                             don't think it will have the final API (e.g. for 
> the
>                             prototype I'm going to avoid change the API of
>                             PCollection, and use a "special" Coder instead).
>                             Also even once we go beyond prototype, it will be
>                             @Experimental for some time, so the API will not 
> be
>                             fixed in stone.
> 
>                             Any more comments on this approach before we start
>                             implementing a prototype?
> 
>                             Reuven
> 
>                             On Wed, Jan 31, 2018 at 1:12 PM, Romain 
> Manni-Bucau
>                             <rmannibu...@gmail.com
>                             <mailto:rmannibu...@gmail.com>> wrote:
> 
>                                 If you need help on the json part I'm happy to
>                                 help. To give a few hints on what is very
>                                 doable: we can add an avro module to johnzon
>                                 (asf json{p,b} impl) to back jsonp by avro
>                                 (guess it will be one of the first to be 
> asked)
>                                 for instance.
> 
> 
>                                 Romain Manni-Bucau
>                                 @rmannibucau 
> <https://twitter.com/rmannibucau> |
>                                  Blog <https://rmannibucau.metawerx.net/> | 
> Old
>                                 Blog <http://rmannibucau.wordpress.com> | 
> Github
>                                 <https://github.com/rmannibucau> | LinkedIn
>                                 <https://www.linkedin.com/in/rmannibucau>
> 
>                                 2018-01-31 22:06 GMT+01:00 Reuven Lax
>                                 <re...@google.com <mailto:re...@google.com>>:
> 
>                                     Agree. The initial implementation will be 
> a
>                                     prototype.
> 
>                                     On Wed, Jan 31, 2018 at 12:21 PM,
>                                     Jean-Baptiste Onofré <j...@nanthrax.net
>                                     <mailto:j...@nanthrax.net>> wrote:
> 
>                                         Hi Reuven,
> 
>                                         Agree to be able to describe the 
> schema
>                                         with different format. The good point
>                                         about json schemas is that they are
>                                         described by a spec. My point is also 
> to
>                                         avoid the reinvent the wheel. Just an
>                                         abstract to be able to use Avro, Json,
>                                         Calcite, custom schema descriptors 
> would
>                                         be great.
> 
>                                         Using coder to describe a schema 
> sounds
>                                         like a smart move to implement 
> quickly.
>                                         However, it has to be clear in term of
>                                         documentation to avoid "side effect". 
> I
>                                         still think PCollection.setSchema() is
>                                         better: it should be metadata (or hint
>                                         ;))) on the PCollection.
> 
>                                         Regards
>                                         JB
> 
>                                         On 31/01/2018 20:16, Reuven Lax wrote:
> 
>                                             As to the question of how a schema
>                                             should be specified, I want to
>                                             support several common schema
>                                             formats. So if a user has a Json
>                                             schema, or an Avro schema, or a
>                                             Calcite schema, etc. there should 
> be
>                                             adapters that allow setting a 
> schema
>                                             from any of them. I don't think we
>                                             should prefer one over the other.
>                                             While Romain is right that many
>                                             people know Json, I think far 
> fewer
>                                             people know Json schemas.
> 
>                                             Agree, schemas should not be
>                                             enforced (for one thing, that
>                                             wouldn't be backwards 
> compatible!).
>                                             I think for the initial prototype 
> I
>                                             will probably use a special coder 
> to
>                                             represent the schema (with 
> setSchema
>                                             an option on the coder), largely
>                                             because it doesn't require 
> modifying
>                                             PCollection. However I think 
> longer
>                                             term a schema should be an 
> optional
>                                             piece of metadata on the 
> PCollection
>                                             object. Similar to the previous
>                                             discussion about "hints," I think
>                                             this can be set on the producing
>                                             PTransform, and a SetSchema
>                                             PTransform will allow attaching a
>                                             schema to any PCollection (i.e.
>                                             pc.apply(SetSchema.of(schema))).
>                                             This part isn't designed yet, but 
> I
>                                             think schema should be similar to
>                                             hints, it's just another piece of
>                                             metadata on the PCollection 
> (though
>                                             something interpreted by the 
> model,
>                                             where hints are interpreted by the
>                                             runner)
> 
>                                             Reuven
> 
>                                             On Tue, Jan 30, 2018 at 1:37 AM,
>                                             Jean-Baptiste Onofré
>                                             <j...@nanthrax.net
>                                             <mailto:j...@nanthrax.net>
>                                             <mailto:j...@nanthrax.net
>                                             <mailto:j...@nanthrax.net>>> 
> wrote:
> 
>                                                 Hi,
> 
>                                                 I think we should avoid to mix
>                                             two things in the discussion (and 
> so
>                                                 the document):
> 
>                                                 1. The element of the 
> collection
>                                             and the schema itself are two
>                                                 different things.
>                                                 By essence, Beam should not
>                                             enforce any schema. That's why I 
> think
>                                                 it's a good
>                                                 idea to set the schema
>                                             optionally on the PCollection
>                                                 (pcollection.setSchema()).
> 
>                                                 2. From point 1 comes two
>                                             questions: how do we represent a
>                                             schema ?
>                                                 How can we
>                                                 leverage the schema to 
> simplify
>                                             the serialization of the element 
> in the
>                                                 PCollection and query ? These
>                                             two questions are not directly 
> related.
> 
>                                                   2.1 How do we represent the 
> schema
>                                                 Json Schema is a very
>                                             interesting idea. It could be an
>                                             abstract and
>                                                 other
>                                                 providers, like Avro, can be
>                                             bind on it. It's part of the json
>                                                 processing spec
>                                                 (javax).
> 
>                                                   2.2. How do we leverage the
>                                             schema for query and serialization
>                                                 Also in the spec, json pointer
>                                             is interesting for the querying.
>                                                 Regarding the
>                                                 serialization, jackson or 
> other
>                                             data binder can be used.
> 
>                                                 It's still rough ideas in my
>                                             mind, but I like Romain's idea 
> about
>                                                 json-p usage.
> 
>                                                 Once 2.3.0 release is out, I
>                                             will start to update the document 
> with
>                                                 those ideas,
>                                                 and PoC.
> 
>                                                 Thanks !
>                                                 Regards
>                                                 JB
> 
>                                                 On 01/30/2018 08:42 AM, Romain
>                                             Manni-Bucau wrote:
>                                                 >
>                                                 >
>                                                 > Le 30 janv. 2018 01:09,
>                                             "Reuven Lax" <re...@google.com
>                                             <mailto:re...@google.com>
>                                             <mailto:re...@google.com
>                                             <mailto:re...@google.com>>
>                                                  > <mailto:re...@google.com
>                                             <mailto:re...@google.com>
>                                             <mailto:re...@google.com
>                                             <mailto:re...@google.com>>>> a 
> écrit :
>                                                 >
>                                                 >
>                                                 >
>                                                 >     On Mon, Jan 29, 2018 at
>                                             12:17 PM, Romain Manni-Bucau
>                                             <rmannibu...@gmail.com
>                                             <mailto:rmannibu...@gmail.com>
>                                             <mailto:rmannibu...@gmail.com
>                                             <mailto:rmannibu...@gmail.com>>
>                                                  >   
>                                              <mailto:rmannibu...@gmail.com
>                                             <mailto:rmannibu...@gmail.com>
> 
>                                                 <mailto:rmannibu...@gmail.com
>                                             <mailto:rmannibu...@gmail.com>>>> 
> wrote:
>                                                  >
>                                                  >         Hi
>                                                  >
>                                                  >         I have some 
> questions
>                                             on this: how hierarchic schemas
>                                                 would work? Seems
>                                                  >         it is not really
>                                             supported by the ecosystem (out of
>                                                 custom stuff) :(.
>                                                  >         How would it
>                                             integrate smoothly with other
>                                             generic record
>                                                 types - N bridges?
>                                                  >
>                                                  >
>                                                  >     Do you mean nested
>                                             schemas? What do you mean here?
>                                                  >
>                                                  >
>                                                  > Yes, sorry - wrote the mail
>                                             too late ;). Was hierarchic data 
> and
>                                                 nested schemas.
>                                                  >
>                                                  >
>                                                  >         Concretely I wonder
>                                             if using json API couldnt be
>                                                 beneficial: json-p is a
>                                                  >         nice generic
>                                             abstraction with a built in 
> querying
>                                                 mecanism (jsonpointer)
>                                                  >         but no actual
>                                             serialization (even if json and
>                                             binary json
>                                                 are very
>                                                  >         natural). The big
>                                             advantage is to have a well known
>                                                 ecosystem - who
>                                                  >         doesnt know json
>                                             today? - that beam can reuse for 
> free:
>                                                 JsonObject
>                                                  >         (guess we dont want
>                                             JsonValue abstraction) for the 
> record
>                                                 type,
>                                                  >         jsonschema standard
>                                             for the schema, jsonpointer for 
> the
>                                                  >         
> delection/projection
>                                             etc... It doesnt enforce the 
> actual
>                                                 serialization
>                                                  >         (json, smile, avro,
>                                             ...) but provide an expressive and
>                                                 alread known API
>                                                  >         so i see it as a 
> big
>                                             win-win for users (no need to 
> learn
>                                                 a new API and
>                                                  >         use N bridges in 
> all
>                                             ways) and beam (impls are here and
>                                                 API design
>                                                  >         already thought).
>                                                  >
>                                                  >
>                                                  >     I assume you're talking
>                                             about the API for setting schemas,
>                                                 not using them.
>                                                  >     Json has many downsides
>                                             and I'm not sure it's true that
>                                                 everyone knows it;
>                                                  >     there are also 
> competing
>                                             schema APIs, such as Avro etc..
>                                                 However I think we
>                                                  >     should give Json a fair
>                                             evaluation before dismissing it.
>                                                  >
>                                                  >
>                                                  > It is a wider topic than
>                                             schema. Actually schema are not 
> the
>                                                 first citizen but a
>                                                  > generic data representation
>                                             is. That is where json hits almost
>                                                 any other API.
>                                                  > Then, when it comes to
>                                             schema, json has a standard for 
> that
>                                             so we
>                                                 are all good.
>                                                  >
>                                                  > Also json has a good 
> indexing
>                                             API compared to alternatives which
>                                                 are sometimes a
>                                                  > bit faster - for noop
>                                             transforms - but are hardly usable
>                                             or make
>                                                 the code not
>                                                  > that readable.
>                                                  >
>                                                  > Avro is a nice competitor 
> but
>                                             it is compatible - actually avro 
> is
>                                                 json driven by
>                                                  > design - but its API is far
>                                             to be that easy due to its schema
>                                                 enforcement which
>                                                  > is heavvvyyy and worse is 
> you
>                                             cant work with avro without a
>                                                 schema. Json would
>                                                  > allow to reconciliate the
>                                             dynamic and static cases since 
> the job
>                                                 wouldnt change
>                                                  > except the setschema.
>                                                  >
>                                                  > That is why I think json 
> is a
>                                             good compromise and having a
>                                                 standard API for it
>                                                  > allow to fully customize 
> the
>                                             imol as will if needed - even 
> using
>                                                 avro or protobuf.
>                                                  >
>                                                  > Side note on beam api: i 
> dont
>                                             think it is good to use a main API
>                                                 for runner
>                                                  > optimization. It enforces
>                                             something to be shared on all 
> runners
>                                                 but not widely
>                                                  > usable. It is also 
> misleading
>                                             for users. Would you set a flink
>                                                 pipeline option
>                                                  > with dataflow? My proposal
>                                             here is to use hints - properties 
> -
>                                                 instead of
>                                                  > something hardly defined in
>                                             the API then standardize it if all
>                                                 runners support it.
>                                                  >
>                                                  >
>                                                  >
>                                                  >         Wdyt?
>                                                  >
>                                                  >         Le 29 janv. 2018
>                                             06:24, "Jean-Baptiste Onofré"
>                                                 <j...@nanthrax.net
>                                             <mailto:j...@nanthrax.net>
>                                             <mailto:j...@nanthrax.net
>                                             <mailto:j...@nanthrax.net>>
>                                                  >       
>                                              <mailto:j...@nanthrax.net
>                                             <mailto:j...@nanthrax.net>
>                                             <mailto:j...@nanthrax.net
>                                             <mailto:j...@nanthrax.net>>>> a 
> écrit :
> 
>                                                  >
>                                                  >             Hi Reuven,
>                                                  >
>                                                  >             Thanks for the
>                                             update ! As I'm working with you 
> on
>                                                 this, I fully
>                                                  >             agree and great
>                                                  >             doc gathering 
> the
>                                             ideas.
>                                                  >
>                                                  >             It's clearly
>                                             something we have to add asap in 
> Beam,
>                                                 because it would
>                                                  >             allow new
>                                                  >             use cases for 
> our
>                                             users (in a simple way) and open
>                                                 new areas for the
>                                                  >             runners
>                                                  >             (for instance
>                                             dataframe support in the Spark 
> runner).
>                                                  >
>                                                  >             By the way, 
> while
>                                             ago, I created BEAM-3437 to track
>                                                 the PoC/PR
>                                                  >             around this.
>                                                  >
>                                                  >             Thanks !
>                                                  >
>                                                  >             Regards
>                                                  >             JB
>                                                  >
>                                                  >             On 01/29/2018
>                                             02:08 AM, Reuven Lax wrote:
>                                                  >             > Previously I
>                                             submitted a proposal for adding
>                                                 schemas as a
>                                                  >             first-class
>                                             concept on
>                                                  >             > Beam
>                                             PCollections. The proposal
>                                             engendered quite a
>                                                 bit of
>                                                  >             discussion 
> from the
>                                                  >             > community -
>                                             more discussion than I've seen 
> from
>                                                 almost any of our
>                                                  >             proposals to
>                                                  >             > date!
>                                                  >             >
>                                                  >             > Based on the
>                                             feedback and comments, I reworked 
> the
>                                                 proposal
>                                                  >             document quite 
> a
>                                                  >             > bit. It now
>                                             talks more explicitly about the
>                                                 different between
>                                                  >             dynamic schemas
>                                                  >             > (where the
>                                             schema is not fully not know at
>                                                 graph-creation time),
>                                                  >             and static
>                                                  >             > schemas 
> (which
>                                             are fully know at graph-creation
>                                                 time). Proposed
>                                                  >             APIs are more
>                                                  >             > fleshed out 
> now
>                                             (again thanks to feedback from
>                                                 community members),
>                                                  >             and the
>                                                  >             > document 
> talks
>                                             in more detail about evolving 
> schemas in
>                                                  >             long-running
>                                             streaming
>                                                  >             > pipelines.
>                                                  >             >
>                                                  >             > Please take a
>                                             look. I think this will be very
>                                                 valuable to Beam,
>                                                  >             and welcome any
>                                                  >             > feedback.
>                                                  >             >
>                                                  >             >
>                                                  >
>                                                
>                                             
> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#
>                                             
> <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#>
>                                                
>                                             
> <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#
>                                             
> <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#>>
>                                                  >               
>                                              
> <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#
>                                             
> <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#>
>                                             
> <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#
>                                             
> <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#>>>
>                                                  >             >
>                                                  >             > Reuven
>                                                  >
>                                                  >             --
>                                                  >             Jean-Baptiste 
> Onofré
>                                                  > jbono...@apache.org
>                                             <mailto:jbono...@apache.org>
>                                             <mailto:jbono...@apache.org
>                                             <mailto:jbono...@apache.org>>
>                                                 <mailto:jbono...@apache.org
>                                             <mailto:jbono...@apache.org>
>                                             <mailto:jbono...@apache.org
>                                             <mailto:jbono...@apache.org>>>
>                                                  > http://blog.nanthrax.net
>                                                  >             Talend -
>                                             http://www.talend.com
>                                                  >
>                                                  >
>                                                  >
> 
>                                                 --
>                                                 Jean-Baptiste Onofré
>                                                 jbono...@apache.org
>                                             <mailto:jbono...@apache.org>
>                                             <mailto:jbono...@apache.org
>                                             <mailto:jbono...@apache.org>>
>                                                 http://blog.nanthrax.net
>                                                 Talend - http://www.talend.com
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Schema-Aware PCollections revisited

Reply via email to