Sorry guys, I was off today. Happy to be part of the party too ;) Regards JB
On 02/04/2018 06:19 PM, Reuven Lax wrote: > Romain, since you're interested maybe the two of us should put together a > proposal for how to set this things (hints, schema) on PCollections? I don't > think it'll be hard - the previous list thread on hints already agreed on a > general approach, and we would just need to flesh it out. > > BTW in the past when I looked, Json schemas seemed to have some odd > limitations > inherited from Javascript (e.g. no distinction between integer and > floating-point types). Is that still true? > > Reuven > > On Sun, Feb 4, 2018 at 9:12 AM, Romain Manni-Bucau <rmannibu...@gmail.com > <mailto:rmannibu...@gmail.com>> wrote: > > > > 2018-02-04 17:53 GMT+01:00 Reuven Lax <re...@google.com > <mailto:re...@google.com>>: > > > > On Sun, Feb 4, 2018 at 8:42 AM, Romain Manni-Bucau > <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>> wrote: > > > 2018-02-04 17:37 GMT+01:00 Reuven Lax <re...@google.com > <mailto:re...@google.com>>: > > I'm not sure where proto comes from here. Proto is one example > of a type that has a schema, but only one example. > > 1. In the initial prototype I want to avoid modifying the > PCollection API. So I think it's best to create a special > SchemaCoder, and pass the schema into this coder. Later we > might > targeted APIs for this instead of going through a coder. > 1.a I don't see what hints have to do with this? > > > Hints are a way to replace the new API and unify the way to pass > metadata in beam instead of adding a new custom way each time. > > > I don't think schema is a hint. But I hear what your saying - hint is > a > type of PCollection metadata as is schema, and we should have a > unified > API for setting such metadata. > > > :), Ismael pointed me out earlier this week that "hint" had an old meaning > in beam. My usage is purely the one done in most EE spec (your "metadata" > in > previous answer). But guess we are aligned on the meaning now, just wanted > to be sure. > > > > > > > > 2. BeamSQL already has a generic record type which fits this > use > case very well (though we might modify it). However as > mentioned > in the doc, the user is never forced to use this generic > record > type. > > > Well yes and not. A type already exists but 1. it is very strictly > limited (flat/columns only which is very few of what big data SQL > can do) and 2. it must be aligned on the converge of generic data > the schema will bring (really read "aligned" as "dropped in favor > of" - deprecated being a smooth way to do it). > > > As I said the existing class needs to be modified and extended, and > not > just for this schema us was. It was meant to represent Calcite SQL > rows, > but doesn't quite even do that yet (Calcite supports nested rows). > However I think it's the right basis to start from. > > > Agree on the state. Current impl issues I hit (additionally to the nested > support which would require by itself a kind of visitor solution) are the > fact to own the schema in the record and handle field by field the > serialization instead of as a whole which is how it would be handled with > a > schema IMHO. > > Concretely what I don't want is to do a PoC which works - they all work > right? and integrate to beam without thinking to a global solution for > this > generic record issue and its schema standardization. This is where > Json(-P) > has a lot of value IMHO but requires a bit more love than just adding > schema > in the model. > > > > > > So long story short the main work of this schema track is not only > on using schema in runners and other ways but also starting to > make > beam consistent with itself which is probably the most important > outcome since it is the user facing side of this work. > > > > On Sun, Feb 4, 2018 at 12:22 AM, Romain Manni-Bucau > <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>> wrote: > > @Reuven: is the proto only about passing schema or also > the > generic type? > > There are 2.5 topics to solve this issue: > > 1. How to pass schema > 1.a. hints? > 2. What is the generic record type associated to a schema > and how to express a schema relatively to it > > I would be happy to help on 1.a and 2 somehow if you need. > > Le 4 févr. 2018 03:30, "Reuven Lax" <re...@google.com > <mailto:re...@google.com>> a écrit : > > One more thing. If anyone here has experience with > various OSS metadata stores (e.g. Kafka Schema > Registry > is one example), would you like to collaborate on > implementation? I want to make sure that source > schemas > can be stored in a variety of OSS metadata stores, and > be easily pulled into a Beam pipeline. > > Reuven > > On Sat, Feb 3, 2018 at 6:28 PM, Reuven Lax > <re...@google.com <mailto:re...@google.com>> wrote: > > Hi all, > > If there are no concerns, I would like to start > working on a prototype. It's just a prototype, so > I > don't think it will have the final API (e.g. for > the > prototype I'm going to avoid change the API of > PCollection, and use a "special" Coder instead). > Also even once we go beyond prototype, it will be > @Experimental for some time, so the API will not > be > fixed in stone. > > Any more comments on this approach before we start > implementing a prototype? > > Reuven > > On Wed, Jan 31, 2018 at 1:12 PM, Romain > Manni-Bucau > <rmannibu...@gmail.com > <mailto:rmannibu...@gmail.com>> wrote: > > If you need help on the json part I'm happy to > help. To give a few hints on what is very > doable: we can add an avro module to johnzon > (asf json{p,b} impl) to back jsonp by avro > (guess it will be one of the first to be > asked) > for instance. > > > Romain Manni-Bucau > @rmannibucau > <https://twitter.com/rmannibucau> | > Blog <https://rmannibucau.metawerx.net/> | > Old > Blog <http://rmannibucau.wordpress.com> | > Github > <https://github.com/rmannibucau> | LinkedIn > <https://www.linkedin.com/in/rmannibucau> > > 2018-01-31 22:06 GMT+01:00 Reuven Lax > <re...@google.com <mailto:re...@google.com>>: > > Agree. The initial implementation will be > a > prototype. > > On Wed, Jan 31, 2018 at 12:21 PM, > Jean-Baptiste Onofré <j...@nanthrax.net > <mailto:j...@nanthrax.net>> wrote: > > Hi Reuven, > > Agree to be able to describe the > schema > with different format. The good point > about json schemas is that they are > described by a spec. My point is also > to > avoid the reinvent the wheel. Just an > abstract to be able to use Avro, Json, > Calcite, custom schema descriptors > would > be great. > > Using coder to describe a schema > sounds > like a smart move to implement > quickly. > However, it has to be clear in term of > documentation to avoid "side effect". > I > still think PCollection.setSchema() is > better: it should be metadata (or hint > ;))) on the PCollection. > > Regards > JB > > On 31/01/2018 20:16, Reuven Lax wrote: > > As to the question of how a schema > should be specified, I want to > support several common schema > formats. So if a user has a Json > schema, or an Avro schema, or a > Calcite schema, etc. there should > be > adapters that allow setting a > schema > from any of them. I don't think we > should prefer one over the other. > While Romain is right that many > people know Json, I think far > fewer > people know Json schemas. > > Agree, schemas should not be > enforced (for one thing, that > wouldn't be backwards > compatible!). > I think for the initial prototype > I > will probably use a special coder > to > represent the schema (with > setSchema > an option on the coder), largely > because it doesn't require > modifying > PCollection. However I think > longer > term a schema should be an > optional > piece of metadata on the > PCollection > object. Similar to the previous > discussion about "hints," I think > this can be set on the producing > PTransform, and a SetSchema > PTransform will allow attaching a > schema to any PCollection (i.e. > pc.apply(SetSchema.of(schema))). > This part isn't designed yet, but > I > think schema should be similar to > hints, it's just another piece of > metadata on the PCollection > (though > something interpreted by the > model, > where hints are interpreted by the > runner) > > Reuven > > On Tue, Jan 30, 2018 at 1:37 AM, > Jean-Baptiste Onofré > <j...@nanthrax.net > <mailto:j...@nanthrax.net> > <mailto:j...@nanthrax.net > <mailto:j...@nanthrax.net>>> > wrote: > > Hi, > > I think we should avoid to mix > two things in the discussion (and > so > the document): > > 1. The element of the > collection > and the schema itself are two > different things. > By essence, Beam should not > enforce any schema. That's why I > think > it's a good > idea to set the schema > optionally on the PCollection > (pcollection.setSchema()). > > 2. From point 1 comes two > questions: how do we represent a > schema ? > How can we > leverage the schema to > simplify > the serialization of the element > in the > PCollection and query ? These > two questions are not directly > related. > > 2.1 How do we represent the > schema > Json Schema is a very > interesting idea. It could be an > abstract and > other > providers, like Avro, can be > bind on it. It's part of the json > processing spec > (javax). > > 2.2. How do we leverage the > schema for query and serialization > Also in the spec, json pointer > is interesting for the querying. > Regarding the > serialization, jackson or > other > data binder can be used. > > It's still rough ideas in my > mind, but I like Romain's idea > about > json-p usage. > > Once 2.3.0 release is out, I > will start to update the document > with > those ideas, > and PoC. > > Thanks ! > Regards > JB > > On 01/30/2018 08:42 AM, Romain > Manni-Bucau wrote: > > > > > > Le 30 janv. 2018 01:09, > "Reuven Lax" <re...@google.com > <mailto:re...@google.com> > <mailto:re...@google.com > <mailto:re...@google.com>> > > <mailto:re...@google.com > <mailto:re...@google.com> > <mailto:re...@google.com > <mailto:re...@google.com>>>> a > écrit : > > > > > > > > On Mon, Jan 29, 2018 at > 12:17 PM, Romain Manni-Bucau > <rmannibu...@gmail.com > <mailto:rmannibu...@gmail.com> > <mailto:rmannibu...@gmail.com > <mailto:rmannibu...@gmail.com>> > > > <mailto:rmannibu...@gmail.com > <mailto:rmannibu...@gmail.com> > > <mailto:rmannibu...@gmail.com > <mailto:rmannibu...@gmail.com>>>> > wrote: > > > > Hi > > > > I have some > questions > on this: how hierarchic schemas > would work? Seems > > it is not really > supported by the ecosystem (out of > custom stuff) :(. > > How would it > integrate smoothly with other > generic record > types - N bridges? > > > > > > Do you mean nested > schemas? What do you mean here? > > > > > > Yes, sorry - wrote the mail > too late ;). Was hierarchic data > and > nested schemas. > > > > > > Concretely I wonder > if using json API couldnt be > beneficial: json-p is a > > nice generic > abstraction with a built in > querying > mecanism (jsonpointer) > > but no actual > serialization (even if json and > binary json > are very > > natural). The big > advantage is to have a well known > ecosystem - who > > doesnt know json > today? - that beam can reuse for > free: > JsonObject > > (guess we dont want > JsonValue abstraction) for the > record > type, > > jsonschema standard > for the schema, jsonpointer for > the > > > delection/projection > etc... It doesnt enforce the > actual > serialization > > (json, smile, avro, > ...) but provide an expressive and > alread known API > > so i see it as a > big > win-win for users (no need to > learn > a new API and > > use N bridges in > all > ways) and beam (impls are here and > API design > > already thought). > > > > > > I assume you're talking > about the API for setting schemas, > not using them. > > Json has many downsides > and I'm not sure it's true that > everyone knows it; > > there are also > competing > schema APIs, such as Avro etc.. > However I think we > > should give Json a fair > evaluation before dismissing it. > > > > > > It is a wider topic than > schema. Actually schema are not > the > first citizen but a > > generic data representation > is. That is where json hits almost > any other API. > > Then, when it comes to > schema, json has a standard for > that > so we > are all good. > > > > Also json has a good > indexing > API compared to alternatives which > are sometimes a > > bit faster - for noop > transforms - but are hardly usable > or make > the code not > > that readable. > > > > Avro is a nice competitor > but > it is compatible - actually avro > is > json driven by > > design - but its API is far > to be that easy due to its schema > enforcement which > > is heavvvyyy and worse is > you > cant work with avro without a > schema. Json would > > allow to reconciliate the > dynamic and static cases since > the job > wouldnt change > > except the setschema. > > > > That is why I think json > is a > good compromise and having a > standard API for it > > allow to fully customize > the > imol as will if needed - even > using > avro or protobuf. > > > > Side note on beam api: i > dont > think it is good to use a main API > for runner > > optimization. It enforces > something to be shared on all > runners > but not widely > > usable. It is also > misleading > for users. Would you set a flink > pipeline option > > with dataflow? My proposal > here is to use hints - properties > - > instead of > > something hardly defined in > the API then standardize it if all > runners support it. > > > > > > > > Wdyt? > > > > Le 29 janv. 2018 > 06:24, "Jean-Baptiste Onofré" > <j...@nanthrax.net > <mailto:j...@nanthrax.net> > <mailto:j...@nanthrax.net > <mailto:j...@nanthrax.net>> > > > <mailto:j...@nanthrax.net > <mailto:j...@nanthrax.net> > <mailto:j...@nanthrax.net > <mailto:j...@nanthrax.net>>>> a > écrit : > > > > > Hi Reuven, > > > > Thanks for the > update ! As I'm working with you > on > this, I fully > > agree and great > > doc gathering > the > ideas. > > > > It's clearly > something we have to add asap in > Beam, > because it would > > allow new > > use cases for > our > users (in a simple way) and open > new areas for the > > runners > > (for instance > dataframe support in the Spark > runner). > > > > By the way, > while > ago, I created BEAM-3437 to track > the PoC/PR > > around this. > > > > Thanks ! > > > > Regards > > JB > > > > On 01/29/2018 > 02:08 AM, Reuven Lax wrote: > > > Previously I > submitted a proposal for adding > schemas as a > > first-class > concept on > > > Beam > PCollections. The proposal > engendered quite a > bit of > > discussion > from the > > > community - > more discussion than I've seen > from > almost any of our > > proposals to > > > date! > > > > > > Based on the > feedback and comments, I reworked > the > proposal > > document quite > a > > > bit. It now > talks more explicitly about the > different between > > dynamic schemas > > > (where the > schema is not fully not know at > graph-creation time), > > and static > > > schemas > (which > are fully know at graph-creation > time). Proposed > > APIs are more > > > fleshed out > now > (again thanks to feedback from > community members), > > and the > > > document > talks > in more detail about evolving > schemas in > > long-running > streaming > > > pipelines. > > > > > > Please take a > look. I think this will be very > valuable to Beam, > > and welcome any > > > feedback. > > > > > > > > > > > https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit# > > <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#> > > > <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit# > > <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#>> > > > > <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit# > > <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#> > > <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit# > > <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#>>> > > > > > > Reuven > > > > -- > > Jean-Baptiste > Onofré > > jbono...@apache.org > <mailto:jbono...@apache.org> > <mailto:jbono...@apache.org > <mailto:jbono...@apache.org>> > <mailto:jbono...@apache.org > <mailto:jbono...@apache.org> > <mailto:jbono...@apache.org > <mailto:jbono...@apache.org>>> > > http://blog.nanthrax.net > > Talend - > http://www.talend.com > > > > > > > > -- > Jean-Baptiste Onofré > jbono...@apache.org > <mailto:jbono...@apache.org> > <mailto:jbono...@apache.org > <mailto:jbono...@apache.org>> > http://blog.nanthrax.net > Talend - http://www.talend.com > > > > > > > > > > > -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com