One more thing. If anyone here has experience with various OSS metadata stores (e.g. Kafka Schema Registry is one example), would you like to collaborate on implementation? I want to make sure that source schemas can be stored in a variety of OSS metadata stores, and be easily pulled into a Beam pipeline.
Reuven On Sat, Feb 3, 2018 at 6:28 PM, Reuven Lax <re...@google.com> wrote: > Hi all, > > If there are no concerns, I would like to start working on a prototype. > It's just a prototype, so I don't think it will have the final API (e.g. > for the prototype I'm going to avoid change the API of PCollection, and use > a "special" Coder instead). Also even once we go beyond prototype, it will > be @Experimental for some time, so the API will not be fixed in stone. > > Any more comments on this approach before we start implementing a > prototype? > > Reuven > > On Wed, Jan 31, 2018 at 1:12 PM, Romain Manni-Bucau <rmannibu...@gmail.com > > wrote: > >> If you need help on the json part I'm happy to help. To give a few hints >> on what is very doable: we can add an avro module to johnzon (asf json{p,b} >> impl) to back jsonp by avro (guess it will be one of the first to be asked) >> for instance. >> >> >> Romain Manni-Bucau >> @rmannibucau <https://twitter.com/rmannibucau> | Blog >> <https://rmannibucau.metawerx.net/> | Old Blog >> <http://rmannibucau.wordpress.com> | Github >> <https://github.com/rmannibucau> | LinkedIn >> <https://www.linkedin.com/in/rmannibucau> >> >> 2018-01-31 22:06 GMT+01:00 Reuven Lax <re...@google.com>: >> >>> Agree. The initial implementation will be a prototype. >>> >>> On Wed, Jan 31, 2018 at 12:21 PM, Jean-Baptiste Onofré <j...@nanthrax.net> >>> wrote: >>> >>>> Hi Reuven, >>>> >>>> Agree to be able to describe the schema with different format. The good >>>> point about json schemas is that they are described by a spec. My point is >>>> also to avoid the reinvent the wheel. Just an abstract to be able to use >>>> Avro, Json, Calcite, custom schema descriptors would be great. >>>> >>>> Using coder to describe a schema sounds like a smart move to implement >>>> quickly. However, it has to be clear in term of documentation to avoid >>>> "side effect". I still think PCollection.setSchema() is better: it should >>>> be metadata (or hint ;))) on the PCollection. >>>> >>>> Regards >>>> JB >>>> >>>> On 31/01/2018 20:16, Reuven Lax wrote: >>>> >>>>> As to the question of how a schema should be specified, I want to >>>>> support several common schema formats. So if a user has a Json schema, or >>>>> an Avro schema, or a Calcite schema, etc. there should be adapters that >>>>> allow setting a schema from any of them. I don't think we should prefer >>>>> one >>>>> over the other. While Romain is right that many people know Json, I think >>>>> far fewer people know Json schemas. >>>>> >>>>> Agree, schemas should not be enforced (for one thing, that wouldn't be >>>>> backwards compatible!). I think for the initial prototype I will probably >>>>> use a special coder to represent the schema (with setSchema an option on >>>>> the coder), largely because it doesn't require modifying PCollection. >>>>> However I think longer term a schema should be an optional piece of >>>>> metadata on the PCollection object. Similar to the previous discussion >>>>> about "hints," I think this can be set on the producing PTransform, and a >>>>> SetSchema PTransform will allow attaching a schema to any PCollection >>>>> (i.e. >>>>> pc.apply(SetSchema.of(schema))). This part isn't designed yet, but I >>>>> think schema should be similar to hints, it's just another piece of >>>>> metadata on the PCollection (though something interpreted by the model, >>>>> where hints are interpreted by the runner) >>>>> >>>>> Reuven >>>>> >>>>> On Tue, Jan 30, 2018 at 1:37 AM, Jean-Baptiste Onofré <j...@nanthrax.net >>>>> <mailto:j...@nanthrax.net>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I think we should avoid to mix two things in the discussion (and so >>>>> the document): >>>>> >>>>> 1. The element of the collection and the schema itself are two >>>>> different things. >>>>> By essence, Beam should not enforce any schema. That's why I think >>>>> it's a good >>>>> idea to set the schema optionally on the PCollection >>>>> (pcollection.setSchema()). >>>>> >>>>> 2. From point 1 comes two questions: how do we represent a schema ? >>>>> How can we >>>>> leverage the schema to simplify the serialization of the element >>>>> in the >>>>> PCollection and query ? These two questions are not directly >>>>> related. >>>>> >>>>> 2.1 How do we represent the schema >>>>> Json Schema is a very interesting idea. It could be an abstract and >>>>> other >>>>> providers, like Avro, can be bind on it. It's part of the json >>>>> processing spec >>>>> (javax). >>>>> >>>>> 2.2. How do we leverage the schema for query and serialization >>>>> Also in the spec, json pointer is interesting for the querying. >>>>> Regarding the >>>>> serialization, jackson or other data binder can be used. >>>>> >>>>> It's still rough ideas in my mind, but I like Romain's idea about >>>>> json-p usage. >>>>> >>>>> Once 2.3.0 release is out, I will start to update the document with >>>>> those ideas, >>>>> and PoC. >>>>> >>>>> Thanks ! >>>>> Regards >>>>> JB >>>>> >>>>> On 01/30/2018 08:42 AM, Romain Manni-Bucau wrote: >>>>> > >>>>> > >>>>> > Le 30 janv. 2018 01:09, "Reuven Lax" <re...@google.com <mailto: >>>>> re...@google.com> >>>>> > <mailto:re...@google.com <mailto:re...@google.com>>> a écrit : >>>>> > >>>>> > >>>>> > >>>>> > On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau < >>>>> rmannibu...@gmail.com <mailto:rmannibu...@gmail.com> >>>>> > <mailto:rmannibu...@gmail.com >>>>> >>>>> <mailto:rmannibu...@gmail.com>>> wrote: >>>>> > >>>>> > Hi >>>>> > >>>>> > I have some questions on this: how hierarchic schemas >>>>> would work? Seems >>>>> > it is not really supported by the ecosystem (out of >>>>> custom stuff) :(. >>>>> > How would it integrate smoothly with other generic >>>>> record >>>>> types - N bridges? >>>>> > >>>>> > >>>>> > Do you mean nested schemas? What do you mean here? >>>>> > >>>>> > >>>>> > Yes, sorry - wrote the mail too late ;). Was hierarchic data and >>>>> nested schemas. >>>>> > >>>>> > >>>>> > Concretely I wonder if using json API couldnt be >>>>> beneficial: json-p is a >>>>> > nice generic abstraction with a built in querying >>>>> mecanism (jsonpointer) >>>>> > but no actual serialization (even if json and binary >>>>> json >>>>> are very >>>>> > natural). The big advantage is to have a well known >>>>> ecosystem - who >>>>> > doesnt know json today? - that beam can reuse for free: >>>>> JsonObject >>>>> > (guess we dont want JsonValue abstraction) for the >>>>> record >>>>> type, >>>>> > jsonschema standard for the schema, jsonpointer for the >>>>> > delection/projection etc... It doesnt enforce the actual >>>>> serialization >>>>> > (json, smile, avro, ...) but provide an expressive and >>>>> alread known API >>>>> > so i see it as a big win-win for users (no need to learn >>>>> a new API and >>>>> > use N bridges in all ways) and beam (impls are here and >>>>> API design >>>>> > already thought). >>>>> > >>>>> > >>>>> > I assume you're talking about the API for setting schemas, >>>>> not using them. >>>>> > Json has many downsides and I'm not sure it's true that >>>>> everyone knows it; >>>>> > there are also competing schema APIs, such as Avro etc.. >>>>> However I think we >>>>> > should give Json a fair evaluation before dismissing it. >>>>> > >>>>> > >>>>> > It is a wider topic than schema. Actually schema are not the >>>>> first citizen but a >>>>> > generic data representation is. That is where json hits almost >>>>> any other API. >>>>> > Then, when it comes to schema, json has a standard for that so >>>>> we >>>>> are all good. >>>>> > >>>>> > Also json has a good indexing API compared to alternatives which >>>>> are sometimes a >>>>> > bit faster - for noop transforms - but are hardly usable or make >>>>> the code not >>>>> > that readable. >>>>> > >>>>> > Avro is a nice competitor but it is compatible - actually avro >>>>> is >>>>> json driven by >>>>> > design - but its API is far to be that easy due to its schema >>>>> enforcement which >>>>> > is heavvvyyy and worse is you cant work with avro without a >>>>> schema. Json would >>>>> > allow to reconciliate the dynamic and static cases since the job >>>>> wouldnt change >>>>> > except the setschema. >>>>> > >>>>> > That is why I think json is a good compromise and having a >>>>> standard API for it >>>>> > allow to fully customize the imol as will if needed - even using >>>>> avro or protobuf. >>>>> > >>>>> > Side note on beam api: i dont think it is good to use a main API >>>>> for runner >>>>> > optimization. It enforces something to be shared on all runners >>>>> but not widely >>>>> > usable. It is also misleading for users. Would you set a flink >>>>> pipeline option >>>>> > with dataflow? My proposal here is to use hints - properties - >>>>> instead of >>>>> > something hardly defined in the API then standardize it if all >>>>> runners support it. >>>>> > >>>>> > >>>>> > >>>>> > Wdyt? >>>>> > >>>>> > Le 29 janv. 2018 06:24, "Jean-Baptiste Onofré" >>>>> <j...@nanthrax.net <mailto:j...@nanthrax.net> >>>>> > <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>>> a >>>>> écrit : >>>>> >>>>> > >>>>> > Hi Reuven, >>>>> > >>>>> > Thanks for the update ! As I'm working with you on >>>>> this, I fully >>>>> > agree and great >>>>> > doc gathering the ideas. >>>>> > >>>>> > It's clearly something we have to add asap in Beam, >>>>> because it would >>>>> > allow new >>>>> > use cases for our users (in a simple way) and open >>>>> new areas for the >>>>> > runners >>>>> > (for instance dataframe support in the Spark >>>>> runner). >>>>> > >>>>> > By the way, while ago, I created BEAM-3437 to track >>>>> the PoC/PR >>>>> > around this. >>>>> > >>>>> > Thanks ! >>>>> > >>>>> > Regards >>>>> > JB >>>>> > >>>>> > On 01/29/2018 02:08 AM, Reuven Lax wrote: >>>>> > > Previously I submitted a proposal for adding >>>>> schemas as a >>>>> > first-class concept on >>>>> > > Beam PCollections. The proposal engendered quite a >>>>> bit of >>>>> > discussion from the >>>>> > > community - more discussion than I've seen from >>>>> almost any of our >>>>> > proposals to >>>>> > > date! >>>>> > > >>>>> > > Based on the feedback and comments, I reworked the >>>>> proposal >>>>> > document quite a >>>>> > > bit. It now talks more explicitly about the >>>>> different between >>>>> > dynamic schemas >>>>> > > (where the schema is not fully not know at >>>>> graph-creation time), >>>>> > and static >>>>> > > schemas (which are fully know at graph-creation >>>>> time). Proposed >>>>> > APIs are more >>>>> > > fleshed out now (again thanks to feedback from >>>>> community members), >>>>> > and the >>>>> > > document talks in more detail about evolving >>>>> schemas in >>>>> > long-running streaming >>>>> > > pipelines. >>>>> > > >>>>> > > Please take a look. I think this will be very >>>>> valuable to Beam, >>>>> > and welcome any >>>>> > > feedback. >>>>> > > >>>>> > > >>>>> > >>>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ >>>>> 12pHGK0QIvXS1FOTgRc/edit# >>>>> <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm >>>>> Q12pHGK0QIvXS1FOTgRc/edit#> >>>>> > <https://docs.google.com/docu >>>>> ment/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit# < >>>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm >>>>> Q12pHGK0QIvXS1FOTgRc/edit#>> >>>>> > > >>>>> > > Reuven >>>>> > >>>>> > -- >>>>> > Jean-Baptiste Onofré >>>>> > jbono...@apache.org <mailto:jbono...@apache.org> >>>>> <mailto:jbono...@apache.org <mailto:jbono...@apache.org>> >>>>> > http://blog.nanthrax.net >>>>> > Talend - http://www.talend.com >>>>> > >>>>> > >>>>> > >>>>> >>>>> -- >>>>> Jean-Baptiste Onofré >>>>> jbono...@apache.org <mailto:jbono...@apache.org> >>>>> http://blog.nanthrax.net >>>>> Talend - http://www.talend.com >>>>> >>>>> >>>>> >>> >> >