Agree. The initial implementation will be a prototype. On Wed, Jan 31, 2018 at 12:21 PM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote:
> Hi Reuven, > > Agree to be able to describe the schema with different format. The good > point about json schemas is that they are described by a spec. My point is > also to avoid the reinvent the wheel. Just an abstract to be able to use > Avro, Json, Calcite, custom schema descriptors would be great. > > Using coder to describe a schema sounds like a smart move to implement > quickly. However, it has to be clear in term of documentation to avoid > "side effect". I still think PCollection.setSchema() is better: it should > be metadata (or hint ;))) on the PCollection. > > Regards > JB > > On 31/01/2018 20:16, Reuven Lax wrote: > >> As to the question of how a schema should be specified, I want to support >> several common schema formats. So if a user has a Json schema, or an Avro >> schema, or a Calcite schema, etc. there should be adapters that allow >> setting a schema from any of them. I don't think we should prefer one over >> the other. While Romain is right that many people know Json, I think far >> fewer people know Json schemas. >> >> Agree, schemas should not be enforced (for one thing, that wouldn't be >> backwards compatible!). I think for the initial prototype I will probably >> use a special coder to represent the schema (with setSchema an option on >> the coder), largely because it doesn't require modifying PCollection. >> However I think longer term a schema should be an optional piece of >> metadata on the PCollection object. Similar to the previous discussion >> about "hints," I think this can be set on the producing PTransform, and a >> SetSchema PTransform will allow attaching a schema to any PCollection (i.e. >> pc.apply(SetSchema.of(schema))). This part isn't designed yet, but I >> think schema should be similar to hints, it's just another piece of >> metadata on the PCollection (though something interpreted by the model, >> where hints are interpreted by the runner) >> >> Reuven >> >> On Tue, Jan 30, 2018 at 1:37 AM, Jean-Baptiste Onofré <j...@nanthrax.net >> <mailto:j...@nanthrax.net>> wrote: >> >> Hi, >> >> I think we should avoid to mix two things in the discussion (and so >> the document): >> >> 1. The element of the collection and the schema itself are two >> different things. >> By essence, Beam should not enforce any schema. That's why I think >> it's a good >> idea to set the schema optionally on the PCollection >> (pcollection.setSchema()). >> >> 2. From point 1 comes two questions: how do we represent a schema ? >> How can we >> leverage the schema to simplify the serialization of the element in >> the >> PCollection and query ? These two questions are not directly related. >> >> 2.1 How do we represent the schema >> Json Schema is a very interesting idea. It could be an abstract and >> other >> providers, like Avro, can be bind on it. It's part of the json >> processing spec >> (javax). >> >> 2.2. How do we leverage the schema for query and serialization >> Also in the spec, json pointer is interesting for the querying. >> Regarding the >> serialization, jackson or other data binder can be used. >> >> It's still rough ideas in my mind, but I like Romain's idea about >> json-p usage. >> >> Once 2.3.0 release is out, I will start to update the document with >> those ideas, >> and PoC. >> >> Thanks ! >> Regards >> JB >> >> On 01/30/2018 08:42 AM, Romain Manni-Bucau wrote: >> > >> > >> > Le 30 janv. 2018 01:09, "Reuven Lax" <re...@google.com <mailto: >> re...@google.com> >> > <mailto:re...@google.com <mailto:re...@google.com>>> a écrit : >> > >> > >> > >> > On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau < >> rmannibu...@gmail.com <mailto:rmannibu...@gmail.com> >> > <mailto:rmannibu...@gmail.com >> >> <mailto:rmannibu...@gmail.com>>> wrote: >> > >> > Hi >> > >> > I have some questions on this: how hierarchic schemas >> would work? Seems >> > it is not really supported by the ecosystem (out of >> custom stuff) :(. >> > How would it integrate smoothly with other generic record >> types - N bridges? >> > >> > >> > Do you mean nested schemas? What do you mean here? >> > >> > >> > Yes, sorry - wrote the mail too late ;). Was hierarchic data and >> nested schemas. >> > >> > >> > Concretely I wonder if using json API couldnt be >> beneficial: json-p is a >> > nice generic abstraction with a built in querying >> mecanism (jsonpointer) >> > but no actual serialization (even if json and binary json >> are very >> > natural). The big advantage is to have a well known >> ecosystem - who >> > doesnt know json today? - that beam can reuse for free: >> JsonObject >> > (guess we dont want JsonValue abstraction) for the record >> type, >> > jsonschema standard for the schema, jsonpointer for the >> > delection/projection etc... It doesnt enforce the actual >> serialization >> > (json, smile, avro, ...) but provide an expressive and >> alread known API >> > so i see it as a big win-win for users (no need to learn >> a new API and >> > use N bridges in all ways) and beam (impls are here and >> API design >> > already thought). >> > >> > >> > I assume you're talking about the API for setting schemas, >> not using them. >> > Json has many downsides and I'm not sure it's true that >> everyone knows it; >> > there are also competing schema APIs, such as Avro etc.. >> However I think we >> > should give Json a fair evaluation before dismissing it. >> > >> > >> > It is a wider topic than schema. Actually schema are not the >> first citizen but a >> > generic data representation is. That is where json hits almost >> any other API. >> > Then, when it comes to schema, json has a standard for that so we >> are all good. >> > >> > Also json has a good indexing API compared to alternatives which >> are sometimes a >> > bit faster - for noop transforms - but are hardly usable or make >> the code not >> > that readable. >> > >> > Avro is a nice competitor but it is compatible - actually avro is >> json driven by >> > design - but its API is far to be that easy due to its schema >> enforcement which >> > is heavvvyyy and worse is you cant work with avro without a >> schema. Json would >> > allow to reconciliate the dynamic and static cases since the job >> wouldnt change >> > except the setschema. >> > >> > That is why I think json is a good compromise and having a >> standard API for it >> > allow to fully customize the imol as will if needed - even using >> avro or protobuf. >> > >> > Side note on beam api: i dont think it is good to use a main API >> for runner >> > optimization. It enforces something to be shared on all runners >> but not widely >> > usable. It is also misleading for users. Would you set a flink >> pipeline option >> > with dataflow? My proposal here is to use hints - properties - >> instead of >> > something hardly defined in the API then standardize it if all >> runners support it. >> > >> > >> > >> > Wdyt? >> > >> > Le 29 janv. 2018 06:24, "Jean-Baptiste Onofré" >> <j...@nanthrax.net <mailto:j...@nanthrax.net> >> > <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>>> a >> écrit : >> >> > >> > Hi Reuven, >> > >> > Thanks for the update ! As I'm working with you on >> this, I fully >> > agree and great >> > doc gathering the ideas. >> > >> > It's clearly something we have to add asap in Beam, >> because it would >> > allow new >> > use cases for our users (in a simple way) and open >> new areas for the >> > runners >> > (for instance dataframe support in the Spark runner). >> > >> > By the way, while ago, I created BEAM-3437 to track >> the PoC/PR >> > around this. >> > >> > Thanks ! >> > >> > Regards >> > JB >> > >> > On 01/29/2018 02:08 AM, Reuven Lax wrote: >> > > Previously I submitted a proposal for adding >> schemas as a >> > first-class concept on >> > > Beam PCollections. The proposal engendered quite a >> bit of >> > discussion from the >> > > community - more discussion than I've seen from >> almost any of our >> > proposals to >> > > date! >> > > >> > > Based on the feedback and comments, I reworked the >> proposal >> > document quite a >> > > bit. It now talks more explicitly about the >> different between >> > dynamic schemas >> > > (where the schema is not fully not know at >> graph-creation time), >> > and static >> > > schemas (which are fully know at graph-creation >> time). Proposed >> > APIs are more >> > > fleshed out now (again thanks to feedback from >> community members), >> > and the >> > > document talks in more detail about evolving schemas >> in >> > long-running streaming >> > > pipelines. >> > > >> > > Please take a look. I think this will be very >> valuable to Beam, >> > and welcome any >> > > feedback. >> > > >> > > >> > >> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ >> 12pHGK0QIvXS1FOTgRc/edit# >> <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm >> Q12pHGK0QIvXS1FOTgRc/edit#> >> > <https://docs.google.com/docu >> ment/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit# < >> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm >> Q12pHGK0QIvXS1FOTgRc/edit#>> >> > > >> > > Reuven >> > >> > -- >> > Jean-Baptiste Onofré >> > jbono...@apache.org <mailto:jbono...@apache.org> >> <mailto:jbono...@apache.org <mailto:jbono...@apache.org>> >> > http://blog.nanthrax.net >> > Talend - http://www.talend.com >> > >> > >> > >> >> -- >> Jean-Baptiste Onofré >> jbono...@apache.org <mailto:jbono...@apache.org> >> http://blog.nanthrax.net >> Talend - http://www.talend.com >> >> >>