Romain, since you're interested maybe the two of us should put together a proposal for how to set this things (hints, schema) on PCollections? I don't think it'll be hard - the previous list thread on hints already agreed on a general approach, and we would just need to flesh it out.
BTW in the past when I looked, Json schemas seemed to have some odd limitations inherited from Javascript (e.g. no distinction between integer and floating-point types). Is that still true? Reuven On Sun, Feb 4, 2018 at 9:12 AM, Romain Manni-Bucau <rmannibu...@gmail.com> wrote: > > > 2018-02-04 17:53 GMT+01:00 Reuven Lax <re...@google.com>: > >> >> >> On Sun, Feb 4, 2018 at 8:42 AM, Romain Manni-Bucau <rmannibu...@gmail.com >> > wrote: >> >>> >>> 2018-02-04 17:37 GMT+01:00 Reuven Lax <re...@google.com>: >>> >>>> I'm not sure where proto comes from here. Proto is one example of a >>>> type that has a schema, but only one example. >>>> >>>> 1. In the initial prototype I want to avoid modifying the PCollection >>>> API. So I think it's best to create a special SchemaCoder, and pass the >>>> schema into this coder. Later we might targeted APIs for this instead of >>>> going through a coder. >>>> 1.a I don't see what hints have to do with this? >>>> >>> >>> Hints are a way to replace the new API and unify the way to pass >>> metadata in beam instead of adding a new custom way each time. >>> >> >> I don't think schema is a hint. But I hear what your saying - hint is a >> type of PCollection metadata as is schema, and we should have a unified API >> for setting such metadata. >> > > :), Ismael pointed me out earlier this week that "hint" had an old meaning > in beam. My usage is purely the one done in most EE spec (your "metadata" > in previous answer). But guess we are aligned on the meaning now, just > wanted to be sure. > > >> >> >>> >>> >>>> >>>> 2. BeamSQL already has a generic record type which fits this use case >>>> very well (though we might modify it). However as mentioned in the doc, the >>>> user is never forced to use this generic record type. >>>> >>>> >>> Well yes and not. A type already exists but 1. it is very strictly >>> limited (flat/columns only which is very few of what big data SQL can do) >>> and 2. it must be aligned on the converge of generic data the schema will >>> bring (really read "aligned" as "dropped in favor of" - deprecated being a >>> smooth way to do it). >>> >> >> As I said the existing class needs to be modified and extended, and not >> just for this schema us was. It was meant to represent Calcite SQL rows, >> but doesn't quite even do that yet (Calcite supports nested rows). However >> I think it's the right basis to start from. >> > > Agree on the state. Current impl issues I hit (additionally to the nested > support which would require by itself a kind of visitor solution) are the > fact to own the schema in the record and handle field by field the > serialization instead of as a whole which is how it would be handled with a > schema IMHO. > > Concretely what I don't want is to do a PoC which works - they all work > right? and integrate to beam without thinking to a global solution for this > generic record issue and its schema standardization. This is where Json(-P) > has a lot of value IMHO but requires a bit more love than just adding > schema in the model. > > >> >> >>> >>> So long story short the main work of this schema track is not only on >>> using schema in runners and other ways but also starting to make beam >>> consistent with itself which is probably the most important outcome since >>> it is the user facing side of this work. >>> >>> >>>> >>>> On Sun, Feb 4, 2018 at 12:22 AM, Romain Manni-Bucau < >>>> rmannibu...@gmail.com> wrote: >>>> >>>>> @Reuven: is the proto only about passing schema or also the generic >>>>> type? >>>>> >>>>> There are 2.5 topics to solve this issue: >>>>> >>>>> 1. How to pass schema >>>>> 1.a. hints? >>>>> 2. What is the generic record type associated to a schema and how to >>>>> express a schema relatively to it >>>>> >>>>> I would be happy to help on 1.a and 2 somehow if you need. >>>>> >>>>> Le 4 févr. 2018 03:30, "Reuven Lax" <re...@google.com> a écrit : >>>>> >>>>>> One more thing. If anyone here has experience with various OSS >>>>>> metadata stores (e.g. Kafka Schema Registry is one example), would you >>>>>> like >>>>>> to collaborate on implementation? I want to make sure that source schemas >>>>>> can be stored in a variety of OSS metadata stores, and be easily pulled >>>>>> into a Beam pipeline. >>>>>> >>>>>> Reuven >>>>>> >>>>>> On Sat, Feb 3, 2018 at 6:28 PM, Reuven Lax <re...@google.com> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> If there are no concerns, I would like to start working on a >>>>>>> prototype. It's just a prototype, so I don't think it will have the >>>>>>> final >>>>>>> API (e.g. for the prototype I'm going to avoid change the API of >>>>>>> PCollection, and use a "special" Coder instead). Also even once we go >>>>>>> beyond prototype, it will be @Experimental for some time, so the API >>>>>>> will >>>>>>> not be fixed in stone. >>>>>>> >>>>>>> Any more comments on this approach before we start implementing a >>>>>>> prototype? >>>>>>> >>>>>>> Reuven >>>>>>> >>>>>>> On Wed, Jan 31, 2018 at 1:12 PM, Romain Manni-Bucau < >>>>>>> rmannibu...@gmail.com> wrote: >>>>>>> >>>>>>>> If you need help on the json part I'm happy to help. To give a few >>>>>>>> hints on what is very doable: we can add an avro module to johnzon (asf >>>>>>>> json{p,b} impl) to back jsonp by avro (guess it will be one of the >>>>>>>> first to >>>>>>>> be asked) for instance. >>>>>>>> >>>>>>>> >>>>>>>> Romain Manni-Bucau >>>>>>>> @rmannibucau <https://twitter.com/rmannibucau> | Blog >>>>>>>> <https://rmannibucau.metawerx.net/> | Old Blog >>>>>>>> <http://rmannibucau.wordpress.com> | Github >>>>>>>> <https://github.com/rmannibucau> | LinkedIn >>>>>>>> <https://www.linkedin.com/in/rmannibucau> >>>>>>>> >>>>>>>> 2018-01-31 22:06 GMT+01:00 Reuven Lax <re...@google.com>: >>>>>>>> >>>>>>>>> Agree. The initial implementation will be a prototype. >>>>>>>>> >>>>>>>>> On Wed, Jan 31, 2018 at 12:21 PM, Jean-Baptiste Onofré < >>>>>>>>> j...@nanthrax.net> wrote: >>>>>>>>> >>>>>>>>>> Hi Reuven, >>>>>>>>>> >>>>>>>>>> Agree to be able to describe the schema with different format. >>>>>>>>>> The good point about json schemas is that they are described by a >>>>>>>>>> spec. My >>>>>>>>>> point is also to avoid the reinvent the wheel. Just an abstract to >>>>>>>>>> be able >>>>>>>>>> to use Avro, Json, Calcite, custom schema descriptors would be great. >>>>>>>>>> >>>>>>>>>> Using coder to describe a schema sounds like a smart move to >>>>>>>>>> implement quickly. However, it has to be clear in term of >>>>>>>>>> documentation to >>>>>>>>>> avoid "side effect". I still think PCollection.setSchema() is >>>>>>>>>> better: it >>>>>>>>>> should be metadata (or hint ;))) on the PCollection. >>>>>>>>>> >>>>>>>>>> Regards >>>>>>>>>> JB >>>>>>>>>> >>>>>>>>>> On 31/01/2018 20:16, Reuven Lax wrote: >>>>>>>>>> >>>>>>>>>>> As to the question of how a schema should be specified, I want >>>>>>>>>>> to support several common schema formats. So if a user has a Json >>>>>>>>>>> schema, >>>>>>>>>>> or an Avro schema, or a Calcite schema, etc. there should be >>>>>>>>>>> adapters that >>>>>>>>>>> allow setting a schema from any of them. I don't think we should >>>>>>>>>>> prefer one >>>>>>>>>>> over the other. While Romain is right that many people know Json, I >>>>>>>>>>> think >>>>>>>>>>> far fewer people know Json schemas. >>>>>>>>>>> >>>>>>>>>>> Agree, schemas should not be enforced (for one thing, that >>>>>>>>>>> wouldn't be backwards compatible!). I think for the initial >>>>>>>>>>> prototype I >>>>>>>>>>> will probably use a special coder to represent the schema (with >>>>>>>>>>> setSchema >>>>>>>>>>> an option on the coder), largely because it doesn't require >>>>>>>>>>> modifying >>>>>>>>>>> PCollection. However I think longer term a schema should be an >>>>>>>>>>> optional >>>>>>>>>>> piece of metadata on the PCollection object. Similar to the previous >>>>>>>>>>> discussion about "hints," I think this can be set on the producing >>>>>>>>>>> PTransform, and a SetSchema PTransform will allow attaching a >>>>>>>>>>> schema to any >>>>>>>>>>> PCollection (i.e. pc.apply(SetSchema.of(schema))). This part >>>>>>>>>>> isn't designed yet, but I think schema should be similar to hints, >>>>>>>>>>> it's >>>>>>>>>>> just another piece of metadata on the PCollection (though something >>>>>>>>>>> interpreted by the model, where hints are interpreted by the runner) >>>>>>>>>>> >>>>>>>>>>> Reuven >>>>>>>>>>> >>>>>>>>>>> On Tue, Jan 30, 2018 at 1:37 AM, Jean-Baptiste Onofré < >>>>>>>>>>> j...@nanthrax.net <mailto:j...@nanthrax.net>> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> I think we should avoid to mix two things in the discussion >>>>>>>>>>> (and so >>>>>>>>>>> the document): >>>>>>>>>>> >>>>>>>>>>> 1. The element of the collection and the schema itself are >>>>>>>>>>> two >>>>>>>>>>> different things. >>>>>>>>>>> By essence, Beam should not enforce any schema. That's why I >>>>>>>>>>> think >>>>>>>>>>> it's a good >>>>>>>>>>> idea to set the schema optionally on the PCollection >>>>>>>>>>> (pcollection.setSchema()). >>>>>>>>>>> >>>>>>>>>>> 2. From point 1 comes two questions: how do we represent a >>>>>>>>>>> schema ? >>>>>>>>>>> How can we >>>>>>>>>>> leverage the schema to simplify the serialization of the >>>>>>>>>>> element in the >>>>>>>>>>> PCollection and query ? These two questions are not directly >>>>>>>>>>> related. >>>>>>>>>>> >>>>>>>>>>> 2.1 How do we represent the schema >>>>>>>>>>> Json Schema is a very interesting idea. It could be an >>>>>>>>>>> abstract and >>>>>>>>>>> other >>>>>>>>>>> providers, like Avro, can be bind on it. It's part of the >>>>>>>>>>> json >>>>>>>>>>> processing spec >>>>>>>>>>> (javax). >>>>>>>>>>> >>>>>>>>>>> 2.2. How do we leverage the schema for query and >>>>>>>>>>> serialization >>>>>>>>>>> Also in the spec, json pointer is interesting for the >>>>>>>>>>> querying. >>>>>>>>>>> Regarding the >>>>>>>>>>> serialization, jackson or other data binder can be used. >>>>>>>>>>> >>>>>>>>>>> It's still rough ideas in my mind, but I like Romain's idea >>>>>>>>>>> about >>>>>>>>>>> json-p usage. >>>>>>>>>>> >>>>>>>>>>> Once 2.3.0 release is out, I will start to update the >>>>>>>>>>> document with >>>>>>>>>>> those ideas, >>>>>>>>>>> and PoC. >>>>>>>>>>> >>>>>>>>>>> Thanks ! >>>>>>>>>>> Regards >>>>>>>>>>> JB >>>>>>>>>>> >>>>>>>>>>> On 01/30/2018 08:42 AM, Romain Manni-Bucau wrote: >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > Le 30 janv. 2018 01:09, "Reuven Lax" <re...@google.com >>>>>>>>>>> <mailto:re...@google.com> >>>>>>>>>>> > <mailto:re...@google.com <mailto:re...@google.com>>> a >>>>>>>>>>> écrit : >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau < >>>>>>>>>>> rmannibu...@gmail.com <mailto:rmannibu...@gmail.com> >>>>>>>>>>> > <mailto:rmannibu...@gmail.com >>>>>>>>>>> >>>>>>>>>>> <mailto:rmannibu...@gmail.com>>> wrote: >>>>>>>>>>> > >>>>>>>>>>> > Hi >>>>>>>>>>> > >>>>>>>>>>> > I have some questions on this: how hierarchic >>>>>>>>>>> schemas >>>>>>>>>>> would work? Seems >>>>>>>>>>> > it is not really supported by the ecosystem (out >>>>>>>>>>> of >>>>>>>>>>> custom stuff) :(. >>>>>>>>>>> > How would it integrate smoothly with other >>>>>>>>>>> generic record >>>>>>>>>>> types - N bridges? >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > Do you mean nested schemas? What do you mean here? >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > Yes, sorry - wrote the mail too late ;). Was hierarchic >>>>>>>>>>> data and >>>>>>>>>>> nested schemas. >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > Concretely I wonder if using json API couldnt be >>>>>>>>>>> beneficial: json-p is a >>>>>>>>>>> > nice generic abstraction with a built in querying >>>>>>>>>>> mecanism (jsonpointer) >>>>>>>>>>> > but no actual serialization (even if json and >>>>>>>>>>> binary json >>>>>>>>>>> are very >>>>>>>>>>> > natural). The big advantage is to have a well >>>>>>>>>>> known >>>>>>>>>>> ecosystem - who >>>>>>>>>>> > doesnt know json today? - that beam can reuse for >>>>>>>>>>> free: >>>>>>>>>>> JsonObject >>>>>>>>>>> > (guess we dont want JsonValue abstraction) for >>>>>>>>>>> the record >>>>>>>>>>> type, >>>>>>>>>>> > jsonschema standard for the schema, jsonpointer >>>>>>>>>>> for the >>>>>>>>>>> > delection/projection etc... It doesnt enforce the >>>>>>>>>>> actual >>>>>>>>>>> serialization >>>>>>>>>>> > (json, smile, avro, ...) but provide an >>>>>>>>>>> expressive and >>>>>>>>>>> alread known API >>>>>>>>>>> > so i see it as a big win-win for users (no need >>>>>>>>>>> to learn >>>>>>>>>>> a new API and >>>>>>>>>>> > use N bridges in all ways) and beam (impls are >>>>>>>>>>> here and >>>>>>>>>>> API design >>>>>>>>>>> > already thought). >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > I assume you're talking about the API for setting >>>>>>>>>>> schemas, >>>>>>>>>>> not using them. >>>>>>>>>>> > Json has many downsides and I'm not sure it's true >>>>>>>>>>> that >>>>>>>>>>> everyone knows it; >>>>>>>>>>> > there are also competing schema APIs, such as Avro >>>>>>>>>>> etc.. >>>>>>>>>>> However I think we >>>>>>>>>>> > should give Json a fair evaluation before dismissing >>>>>>>>>>> it. >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > It is a wider topic than schema. Actually schema are not >>>>>>>>>>> the >>>>>>>>>>> first citizen but a >>>>>>>>>>> > generic data representation is. That is where json hits >>>>>>>>>>> almost >>>>>>>>>>> any other API. >>>>>>>>>>> > Then, when it comes to schema, json has a standard for >>>>>>>>>>> that so we >>>>>>>>>>> are all good. >>>>>>>>>>> > >>>>>>>>>>> > Also json has a good indexing API compared to >>>>>>>>>>> alternatives which >>>>>>>>>>> are sometimes a >>>>>>>>>>> > bit faster - for noop transforms - but are hardly usable >>>>>>>>>>> or make >>>>>>>>>>> the code not >>>>>>>>>>> > that readable. >>>>>>>>>>> > >>>>>>>>>>> > Avro is a nice competitor but it is compatible - actually >>>>>>>>>>> avro is >>>>>>>>>>> json driven by >>>>>>>>>>> > design - but its API is far to be that easy due to its >>>>>>>>>>> schema >>>>>>>>>>> enforcement which >>>>>>>>>>> > is heavvvyyy and worse is you cant work with avro without >>>>>>>>>>> a >>>>>>>>>>> schema. Json would >>>>>>>>>>> > allow to reconciliate the dynamic and static cases since >>>>>>>>>>> the job >>>>>>>>>>> wouldnt change >>>>>>>>>>> > except the setschema. >>>>>>>>>>> > >>>>>>>>>>> > That is why I think json is a good compromise and having a >>>>>>>>>>> standard API for it >>>>>>>>>>> > allow to fully customize the imol as will if needed - >>>>>>>>>>> even using >>>>>>>>>>> avro or protobuf. >>>>>>>>>>> > >>>>>>>>>>> > Side note on beam api: i dont think it is good to use a >>>>>>>>>>> main API >>>>>>>>>>> for runner >>>>>>>>>>> > optimization. It enforces something to be shared on all >>>>>>>>>>> runners >>>>>>>>>>> but not widely >>>>>>>>>>> > usable. It is also misleading for users. Would you set a >>>>>>>>>>> flink >>>>>>>>>>> pipeline option >>>>>>>>>>> > with dataflow? My proposal here is to use hints - >>>>>>>>>>> properties - >>>>>>>>>>> instead of >>>>>>>>>>> > something hardly defined in the API then standardize it >>>>>>>>>>> if all >>>>>>>>>>> runners support it. >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > Wdyt? >>>>>>>>>>> > >>>>>>>>>>> > Le 29 janv. 2018 06:24, "Jean-Baptiste Onofré" >>>>>>>>>>> <j...@nanthrax.net <mailto:j...@nanthrax.net> >>>>>>>>>>> > <mailto:j...@nanthrax.net >>>>>>>>>>> <mailto:j...@nanthrax.net>>> >>>>>>>>>>> a écrit : >>>>>>>>>>> >>>>>>>>>>> > >>>>>>>>>>> > Hi Reuven, >>>>>>>>>>> > >>>>>>>>>>> > Thanks for the update ! As I'm working with >>>>>>>>>>> you on >>>>>>>>>>> this, I fully >>>>>>>>>>> > agree and great >>>>>>>>>>> > doc gathering the ideas. >>>>>>>>>>> > >>>>>>>>>>> > It's clearly something we have to add asap in >>>>>>>>>>> Beam, >>>>>>>>>>> because it would >>>>>>>>>>> > allow new >>>>>>>>>>> > use cases for our users (in a simple way) and >>>>>>>>>>> open >>>>>>>>>>> new areas for the >>>>>>>>>>> > runners >>>>>>>>>>> > (for instance dataframe support in the Spark >>>>>>>>>>> runner). >>>>>>>>>>> > >>>>>>>>>>> > By the way, while ago, I created BEAM-3437 to >>>>>>>>>>> track >>>>>>>>>>> the PoC/PR >>>>>>>>>>> > around this. >>>>>>>>>>> > >>>>>>>>>>> > Thanks ! >>>>>>>>>>> > >>>>>>>>>>> > Regards >>>>>>>>>>> > JB >>>>>>>>>>> > >>>>>>>>>>> > On 01/29/2018 02:08 AM, Reuven Lax wrote: >>>>>>>>>>> > > Previously I submitted a proposal for adding >>>>>>>>>>> schemas as a >>>>>>>>>>> > first-class concept on >>>>>>>>>>> > > Beam PCollections. The proposal engendered >>>>>>>>>>> quite a >>>>>>>>>>> bit of >>>>>>>>>>> > discussion from the >>>>>>>>>>> > > community - more discussion than I've seen >>>>>>>>>>> from >>>>>>>>>>> almost any of our >>>>>>>>>>> > proposals to >>>>>>>>>>> > > date! >>>>>>>>>>> > > >>>>>>>>>>> > > Based on the feedback and comments, I >>>>>>>>>>> reworked the >>>>>>>>>>> proposal >>>>>>>>>>> > document quite a >>>>>>>>>>> > > bit. It now talks more explicitly about the >>>>>>>>>>> different between >>>>>>>>>>> > dynamic schemas >>>>>>>>>>> > > (where the schema is not fully not know at >>>>>>>>>>> graph-creation time), >>>>>>>>>>> > and static >>>>>>>>>>> > > schemas (which are fully know at >>>>>>>>>>> graph-creation >>>>>>>>>>> time). Proposed >>>>>>>>>>> > APIs are more >>>>>>>>>>> > > fleshed out now (again thanks to feedback >>>>>>>>>>> from >>>>>>>>>>> community members), >>>>>>>>>>> > and the >>>>>>>>>>> > > document talks in more detail about >>>>>>>>>>> evolving schemas in >>>>>>>>>>> > long-running streaming >>>>>>>>>>> > > pipelines. >>>>>>>>>>> > > >>>>>>>>>>> > > Please take a look. I think this will be >>>>>>>>>>> very >>>>>>>>>>> valuable to Beam, >>>>>>>>>>> > and welcome any >>>>>>>>>>> > > feedback. >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > >>>>>>>>>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ >>>>>>>>>>> 12pHGK0QIvXS1FOTgRc/edit# >>>>>>>>>>> <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm >>>>>>>>>>> Q12pHGK0QIvXS1FOTgRc/edit#> >>>>>>>>>>> > <https://docs.google.com/docu >>>>>>>>>>> ment/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit# < >>>>>>>>>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm >>>>>>>>>>> Q12pHGK0QIvXS1FOTgRc/edit#>> >>>>>>>>>>> > > >>>>>>>>>>> > > Reuven >>>>>>>>>>> > >>>>>>>>>>> > -- >>>>>>>>>>> > Jean-Baptiste Onofré >>>>>>>>>>> > jbono...@apache.org <mailto:jbono...@apache.org> >>>>>>>>>>> <mailto:jbono...@apache.org <mailto:jbono...@apache.org>> >>>>>>>>>>> > http://blog.nanthrax.net >>>>>>>>>>> > Talend - http://www.talend.com >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Jean-Baptiste Onofré >>>>>>>>>>> jbono...@apache.org <mailto:jbono...@apache.org> >>>>>>>>>>> http://blog.nanthrax.net >>>>>>>>>>> Talend - http://www.talend.com >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>> >>> >> >