Cool, let's chat about this on slack for a bit (which I realized I've been signed out of for some time).
Reuven On Sun, Feb 4, 2018 at 9:21 AM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote: > Sorry guys, I was off today. Happy to be part of the party too ;) > > Regards > JB > > On 02/04/2018 06:19 PM, Reuven Lax wrote: > > Romain, since you're interested maybe the two of us should put together a > > proposal for how to set this things (hints, schema) on PCollections? I > don't > > think it'll be hard - the previous list thread on hints already agreed > on a > > general approach, and we would just need to flesh it out. > > > > BTW in the past when I looked, Json schemas seemed to have some odd > limitations > > inherited from Javascript (e.g. no distinction between integer and > > floating-point types). Is that still true? > > > > Reuven > > > > On Sun, Feb 4, 2018 at 9:12 AM, Romain Manni-Bucau < > rmannibu...@gmail.com > > <mailto:rmannibu...@gmail.com>> wrote: > > > > > > > > 2018-02-04 17:53 GMT+01:00 Reuven Lax <re...@google.com > > <mailto:re...@google.com>>: > > > > > > > > On Sun, Feb 4, 2018 at 8:42 AM, Romain Manni-Bucau > > <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>> wrote: > > > > > > 2018-02-04 17:37 GMT+01:00 Reuven Lax <re...@google.com > > <mailto:re...@google.com>>: > > > > I'm not sure where proto comes from here. Proto is one > example > > of a type that has a schema, but only one example. > > > > 1. In the initial prototype I want to avoid modifying the > > PCollection API. So I think it's best to create a special > > SchemaCoder, and pass the schema into this coder. Later > we might > > targeted APIs for this instead of going through a coder. > > 1.a I don't see what hints have to do with this? > > > > > > Hints are a way to replace the new API and unify the way to > pass > > metadata in beam instead of adding a new custom way each > time. > > > > > > I don't think schema is a hint. But I hear what your saying - > hint is a > > type of PCollection metadata as is schema, and we should have a > unified > > API for setting such metadata. > > > > > > :), Ismael pointed me out earlier this week that "hint" had an old > meaning > > in beam. My usage is purely the one done in most EE spec (your > "metadata" in > > previous answer). But guess we are aligned on the meaning now, just > wanted > > to be sure. > > > > > > > > > > > > > > > > 2. BeamSQL already has a generic record type which fits > this use > > case very well (though we might modify it). However as > mentioned > > in the doc, the user is never forced to use this generic > record > > type. > > > > > > Well yes and not. A type already exists but 1. it is very > strictly > > limited (flat/columns only which is very few of what big > data SQL > > can do) and 2. it must be aligned on the converge of generic > data > > the schema will bring (really read "aligned" as "dropped in > favor > > of" - deprecated being a smooth way to do it). > > > > > > As I said the existing class needs to be modified and extended, > and not > > just for this schema us was. It was meant to represent Calcite > SQL rows, > > but doesn't quite even do that yet (Calcite supports nested > rows). > > However I think it's the right basis to start from. > > > > > > Agree on the state. Current impl issues I hit (additionally to the > nested > > support which would require by itself a kind of visitor solution) > are the > > fact to own the schema in the record and handle field by field the > > serialization instead of as a whole which is how it would be handled > with a > > schema IMHO. > > > > Concretely what I don't want is to do a PoC which works - they all > work > > right? and integrate to beam without thinking to a global solution > for this > > generic record issue and its schema standardization. This is where > Json(-P) > > has a lot of value IMHO but requires a bit more love than just > adding schema > > in the model. > > > > > > > > > > > > So long story short the main work of this schema track is > not only > > on using schema in runners and other ways but also starting > to make > > beam consistent with itself which is probably the most > important > > outcome since it is the user facing side of this work. > > > > > > > > On Sun, Feb 4, 2018 at 12:22 AM, Romain Manni-Bucau > > <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>> > wrote: > > > > @Reuven: is the proto only about passing schema or > also the > > generic type? > > > > There are 2.5 topics to solve this issue: > > > > 1. How to pass schema > > 1.a. hints? > > 2. What is the generic record type associated to a > schema > > and how to express a schema relatively to it > > > > I would be happy to help on 1.a and 2 somehow if you > need. > > > > Le 4 févr. 2018 03:30, "Reuven Lax" < > re...@google.com > > <mailto:re...@google.com>> a écrit : > > > > One more thing. If anyone here has experience > with > > various OSS metadata stores (e.g. Kafka Schema > Registry > > is one example), would you like to collaborate on > > implementation? I want to make sure that source > schemas > > can be stored in a variety of OSS metadata > stores, and > > be easily pulled into a Beam pipeline. > > > > Reuven > > > > On Sat, Feb 3, 2018 at 6:28 PM, Reuven Lax > > <re...@google.com <mailto:re...@google.com>> > wrote: > > > > Hi all, > > > > If there are no concerns, I would like to > start > > working on a prototype. It's just a > prototype, so I > > don't think it will have the final API (e.g. > for the > > prototype I'm going to avoid change the API > of > > PCollection, and use a "special" Coder > instead). > > Also even once we go beyond prototype, it > will be > > @Experimental for some time, so the API will > not be > > fixed in stone. > > > > Any more comments on this approach before we > start > > implementing a prototype? > > > > Reuven > > > > On Wed, Jan 31, 2018 at 1:12 PM, Romain > Manni-Bucau > > <rmannibu...@gmail.com > > <mailto:rmannibu...@gmail.com>> wrote: > > > > If you need help on the json part I'm > happy to > > help. To give a few hints on what is very > > doable: we can add an avro module to > johnzon > > (asf json{p,b} impl) to back jsonp by > avro > > (guess it will be one of the first to be > asked) > > for instance. > > > > > > Romain Manni-Bucau > > @rmannibucau <https://twitter.com/ > rmannibucau> | > > Blog <https://rmannibucau.metawerx.net/> | > Old > > Blog <http://rmannibucau.wordpress.com> > | Github > > <https://github.com/rmannibucau> | > LinkedIn > > <https://www.linkedin.com/in/rmannibucau > > > > > > 2018-01-31 22:06 GMT+01:00 Reuven Lax > > <re...@google.com <mailto: > re...@google.com>>: > > > > Agree. The initial implementation > will be a > > prototype. > > > > On Wed, Jan 31, 2018 at 12:21 PM, > > Jean-Baptiste Onofré < > j...@nanthrax.net > > <mailto:j...@nanthrax.net>> wrote: > > > > Hi Reuven, > > > > Agree to be able to describe the > schema > > with different format. The good > point > > about json schemas is that they > are > > described by a spec. My point is > also to > > avoid the reinvent the wheel. > Just an > > abstract to be able to use Avro, > Json, > > Calcite, custom schema > descriptors would > > be great. > > > > Using coder to describe a schema > sounds > > like a smart move to implement > quickly. > > However, it has to be clear in > term of > > documentation to avoid "side > effect". I > > still think > PCollection.setSchema() is > > better: it should be metadata > (or hint > > ;))) on the PCollection. > > > > Regards > > JB > > > > On 31/01/2018 20:16, Reuven Lax > wrote: > > > > As to the question of how a > schema > > should be specified, I want > to > > support several common schema > > formats. So if a user has a > Json > > schema, or an Avro schema, > or a > > Calcite schema, etc. there > should be > > adapters that allow setting > a schema > > from any of them. I don't > think we > > should prefer one over the > other. > > While Romain is right that > many > > people know Json, I think > far fewer > > people know Json schemas. > > > > Agree, schemas should not be > > enforced (for one thing, that > > wouldn't be backwards > compatible!). > > I think for the initial > prototype I > > will probably use a special > coder to > > represent the schema (with > setSchema > > an option on the coder), > largely > > because it doesn't require > modifying > > PCollection. However I think > longer > > term a schema should be an > optional > > piece of metadata on the > PCollection > > object. Similar to the > previous > > discussion about "hints," I > think > > this can be set on the > producing > > PTransform, and a SetSchema > > PTransform will allow > attaching a > > schema to any PCollection > (i.e. > > > pc.apply(SetSchema.of(schema))). > > This part isn't designed > yet, but I > > think schema should be > similar to > > hints, it's just another > piece of > > metadata on the PCollection > (though > > something interpreted by the > model, > > where hints are interpreted > by the > > runner) > > > > Reuven > > > > On Tue, Jan 30, 2018 at 1:37 > AM, > > Jean-Baptiste Onofré > > <j...@nanthrax.net > > <mailto:j...@nanthrax.net> > > <mailto:j...@nanthrax.net > > <mailto:j...@nanthrax.net>>> > wrote: > > > > Hi, > > > > I think we should avoid > to mix > > two things in the discussion > (and so > > the document): > > > > 1. The element of the > collection > > and the schema itself are two > > different things. > > By essence, Beam should > not > > enforce any schema. That's > why I think > > it's a good > > idea to set the schema > > optionally on the PCollection > > > (pcollection.setSchema()). > > > > 2. From point 1 comes two > > questions: how do we > represent a > > schema ? > > How can we > > leverage the schema to > simplify > > the serialization of the > element in the > > PCollection and query ? > These > > two questions are not > directly related. > > > > 2.1 How do we > represent the schema > > Json Schema is a very > > interesting idea. It could > be an > > abstract and > > other > > providers, like Avro, > can be > > bind on it. It's part of the > json > > processing spec > > (javax). > > > > 2.2. How do we > leverage the > > schema for query and > serialization > > Also in the spec, json > pointer > > is interesting for the > querying. > > Regarding the > > serialization, jackson > or other > > data binder can be used. > > > > It's still rough ideas > in my > > mind, but I like Romain's > idea about > > json-p usage. > > > > Once 2.3.0 release is > out, I > > will start to update the > document with > > those ideas, > > and PoC. > > > > Thanks ! > > Regards > > JB > > > > On 01/30/2018 08:42 AM, > Romain > > Manni-Bucau wrote: > > > > > > > > > Le 30 janv. 2018 01:09, > > "Reuven Lax" < > re...@google.com > > <mailto:re...@google.com> > > <mailto:re...@google.com > > <mailto:re...@google.com>> > > > <mailto: > re...@google.com > > <mailto:re...@google.com> > > <mailto:re...@google.com > > <mailto:re...@google.com>>>> > a écrit : > > > > > > > > > > > > On Mon, Jan 29, > 2018 at > > 12:17 PM, Romain Manni-Bucau > > <rmannibu...@gmail.com > > <mailto: > rmannibu...@gmail.com> > > <mailto: > rmannibu...@gmail.com > > <mailto: > rmannibu...@gmail.com>> > > > > > <mailto: > rmannibu...@gmail.com > > <mailto: > rmannibu...@gmail.com> > > > > <mailto: > rmannibu...@gmail.com > > <mailto: > rmannibu...@gmail.com>>>> wrote: > > > > > > Hi > > > > > > I have some > questions > > on this: how hierarchic > schemas > > would work? Seems > > > it is not > really > > supported by the ecosystem > (out of > > custom stuff) :(. > > > How would it > > integrate smoothly with other > > generic record > > types - N bridges? > > > > > > > > > Do you mean nested > > schemas? What do you mean > here? > > > > > > > > > Yes, sorry - wrote > the mail > > too late ;). Was hierarchic > data and > > nested schemas. > > > > > > > > > Concretely I > wonder > > if using json API couldnt be > > beneficial: json-p is a > > > nice generic > > abstraction with a built in > querying > > mecanism (jsonpointer) > > > but no actual > > serialization (even if json > and > > binary json > > are very > > > natural). The > big > > advantage is to have a well > known > > ecosystem - who > > > doesnt know > json > > today? - that beam can reuse > for free: > > JsonObject > > > (guess we > dont want > > JsonValue abstraction) for > the record > > type, > > > jsonschema > standard > > for the schema, jsonpointer > for the > > > > delection/projection > > etc... It doesnt enforce the > actual > > serialization > > > (json, smile, > avro, > > ...) but provide an > expressive and > > alread known API > > > so i see it > as a big > > win-win for users (no need > to learn > > a new API and > > > use N bridges > in all > > ways) and beam (impls are > here and > > API design > > > already > thought). > > > > > > > > > I assume you're > talking > > about the API for setting > schemas, > > not using them. > > > Json has many > downsides > > and I'm not sure it's true > that > > everyone knows it; > > > there are also > competing > > schema APIs, such as Avro > etc.. > > However I think we > > > should give Json > a fair > > evaluation before dismissing > it. > > > > > > > > > It is a wider topic > than > > schema. Actually schema are > not the > > first citizen but a > > > generic data > representation > > is. That is where json hits > almost > > any other API. > > > Then, when it comes to > > schema, json has a standard > for that > > so we > > are all good. > > > > > > Also json has a good > indexing > > API compared to alternatives > which > > are sometimes a > > > bit faster - for noop > > transforms - but are hardly > usable > > or make > > the code not > > > that readable. > > > > > > Avro is a nice > competitor but > > it is compatible - actually > avro is > > json driven by > > > design - but its API > is far > > to be that easy due to its > schema > > enforcement which > > > is heavvvyyy and > worse is you > > cant work with avro without a > > schema. Json would > > > allow to reconciliate > the > > dynamic and static cases > since the job > > wouldnt change > > > except the setschema. > > > > > > That is why I think > json is a > > good compromise and having a > > standard API for it > > > allow to fully > customize the > > imol as will if needed - > even using > > avro or protobuf. > > > > > > Side note on beam > api: i dont > > think it is good to use a > main API > > for runner > > > optimization. It > enforces > > something to be shared on > all runners > > but not widely > > > usable. It is also > misleading > > for users. Would you set a > flink > > pipeline option > > > with dataflow? My > proposal > > here is to use hints - > properties - > > instead of > > > something hardly > defined in > > the API then standardize it > if all > > runners support it. > > > > > > > > > > > > Wdyt? > > > > > > Le 29 janv. > 2018 > > 06:24, "Jean-Baptiste Onofré" > > <j...@nanthrax.net > > <mailto:j...@nanthrax.net> > > <mailto:j...@nanthrax.net > > <mailto:j...@nanthrax.net>> > > > > > <mailto:j...@nanthrax.net > > <mailto:j...@nanthrax.net> > > <mailto:j...@nanthrax.net > > <mailto:j...@nanthrax.net>>>> > a écrit : > > > > > > > > Hi Reuven, > > > > > > Thanks > for the > > update ! As I'm working with > you on > > this, I fully > > > agree and > great > > > doc > gathering the > > ideas. > > > > > > It's > clearly > > something we have to add > asap in Beam, > > because it would > > > allow new > > > use cases > for our > > users (in a simple way) and > open > > new areas for the > > > runners > > > (for > instance > > dataframe support in the > Spark runner). > > > > > > By the > way, while > > ago, I created BEAM-3437 to > track > > the PoC/PR > > > around > this. > > > > > > Thanks ! > > > > > > Regards > > > JB > > > > > > On > 01/29/2018 > > 02:08 AM, Reuven Lax wrote: > > > > > Previously I > > submitted a proposal for > adding > > schemas as a > > > > first-class > > concept on > > > > Beam > > PCollections. The proposal > > engendered quite a > > bit of > > > > discussion from the > > > > > community - > > more discussion than I've > seen from > > almost any of our > > > proposals > to > > > > date! > > > > > > > > Based > on the > > feedback and comments, I > reworked the > > proposal > > > document > quite a > > > > bit. It > now > > talks more explicitly about > the > > different between > > > dynamic > schemas > > > > (where > the > > schema is not fully not know > at > > graph-creation time), > > > and static > > > > schemas > (which > > are fully know at > graph-creation > > time). Proposed > > > APIs are > more > > > > fleshed > out now > > (again thanks to feedback > from > > community members), > > > and the > > > > > document talks > > in more detail about > evolving schemas in > > > > long-running > > streaming > > > > > pipelines. > > > > > > > > Please > take a > > look. I think this will be > very > > valuable to Beam, > > > and > welcome any > > > > > feedback. > > > > > > > > > > > > > > > https://docs.google.com/ > document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit# > > <https://docs.google.com/ > document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#> > > > > <https://docs.google.com/ > document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit# > > <https://docs.google.com/ > document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#>> > > > > > <https://docs.google.com/ > document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit# > > <https://docs.google.com/ > document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#> > > <https://docs.google.com/ > document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit# > > <https://docs.google.com/ > document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#>>> > > > > > > > > Reuven > > > > > > -- > > > > Jean-Baptiste Onofré > > > jbono...@apache.org > > <mailto:jbono...@apache.org> > > <mailto:jbono...@apache.org > > <mailto:jbono...@apache.org > >> > > <mailto: > jbono...@apache.org > > <mailto:jbono...@apache.org> > > <mailto:jbono...@apache.org > > <mailto:jbono...@apache.org > >>> > > > > http://blog.nanthrax.net > > > Talend - > > http://www.talend.com > > > > > > > > > > > > > -- > > Jean-Baptiste Onofré > > jbono...@apache.org > > <mailto:jbono...@apache.org> > > <mailto:jbono...@apache.org > > <mailto:jbono...@apache.org > >> > > http://blog.nanthrax.net > > Talend - > http://www.talend.com > > > > > > > > > > > > > > > > > > > > > > > > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com > >