I don't think "hint" is the right API, as schema is not a hint (it has semantic meaning). However I think the API for schema should look similar to any "hint" API.
On Wed, Jan 31, 2018 at 11:40 AM, Romain Manni-Bucau <rmannibu...@gmail.com> wrote: > > > Le 31 janv. 2018 20:16, "Reuven Lax" <re...@google.com> a écrit : > > As to the question of how a schema should be specified, I want to support > several common schema formats. So if a user has a Json schema, or an Avro > schema, or a Calcite schema, etc. there should be adapters that allow > setting a schema from any of them. I don't think we should prefer one over > the other. While Romain is right that many people know Json, I think far > fewer people know Json schemas. > > > Agree but schema would get an API for beam usage - dont think there is a > standard we can use and we cant use any vendor specific api in beam - so > not a big deal IMO/not a blocker. > > > > Agree, schemas should not be enforced (for one thing, that wouldn't be > backwards compatible!). I think for the initial prototype I will probably > use a special coder to represent the schema (with setSchema an option on > the coder), largely because it doesn't require modifying PCollection. > However I think longer term a schema should be an optional piece of > metadata on the PCollection object. Similar to the previous discussion > about "hints," I think this can be set on the producing PTransform, and a > SetSchema PTransform will allow attaching a schema to any PCollection (i.e. > pc.apply(SetSchema.of(schema))). This part isn't designed yet, but I > think schema should be similar to hints, it's just another piece of > metadata on the PCollection (though something interpreted by the model, > where hints are interpreted by the runner) > > > Schema should probably be contributable from the transform when mandatory > - thinking of avro io here - or an hint as fallback when optional probably. > This sounds good to me and doesnt require another public API than hint. > > > Reuven > > On Tue, Jan 30, 2018 at 1:37 AM, Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > >> Hi, >> >> I think we should avoid to mix two things in the discussion (and so the >> document): >> >> 1. The element of the collection and the schema itself are two different >> things. >> By essence, Beam should not enforce any schema. That's why I think it's a >> good >> idea to set the schema optionally on the PCollection >> (pcollection.setSchema()). >> >> 2. From point 1 comes two questions: how do we represent a schema ? How >> can we >> leverage the schema to simplify the serialization of the element in the >> PCollection and query ? These two questions are not directly related. >> >> 2.1 How do we represent the schema >> Json Schema is a very interesting idea. It could be an abstract and other >> providers, like Avro, can be bind on it. It's part of the json processing >> spec >> (javax). >> >> 2.2. How do we leverage the schema for query and serialization >> Also in the spec, json pointer is interesting for the querying. Regarding >> the >> serialization, jackson or other data binder can be used. >> >> It's still rough ideas in my mind, but I like Romain's idea about json-p >> usage. >> >> Once 2.3.0 release is out, I will start to update the document with those >> ideas, >> and PoC. >> >> Thanks ! >> Regards >> JB >> >> On 01/30/2018 08:42 AM, Romain Manni-Bucau wrote: >> > >> > >> > Le 30 janv. 2018 01:09, "Reuven Lax" <re...@google.com >> > <mailto:re...@google.com>> a écrit : >> > >> > >> > >> > On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau < >> rmannibu...@gmail.com >> > <mailto:rmannibu...@gmail.com>> wrote: >> > >> > Hi >> > >> > I have some questions on this: how hierarchic schemas would >> work? Seems >> > it is not really supported by the ecosystem (out of custom >> stuff) :(. >> > How would it integrate smoothly with other generic record types >> - N bridges? >> > >> > >> > Do you mean nested schemas? What do you mean here? >> > >> > >> > Yes, sorry - wrote the mail too late ;). Was hierarchic data and nested >> schemas. >> > >> > >> > Concretely I wonder if using json API couldnt be beneficial: >> json-p is a >> > nice generic abstraction with a built in querying mecanism >> (jsonpointer) >> > but no actual serialization (even if json and binary json are >> very >> > natural). The big advantage is to have a well known ecosystem - >> who >> > doesnt know json today? - that beam can reuse for free: >> JsonObject >> > (guess we dont want JsonValue abstraction) for the record type, >> > jsonschema standard for the schema, jsonpointer for the >> > delection/projection etc... It doesnt enforce the actual >> serialization >> > (json, smile, avro, ...) but provide an expressive and alread >> known API >> > so i see it as a big win-win for users (no need to learn a new >> API and >> > use N bridges in all ways) and beam (impls are here and API >> design >> > already thought). >> > >> > >> > I assume you're talking about the API for setting schemas, not >> using them. >> > Json has many downsides and I'm not sure it's true that everyone >> knows it; >> > there are also competing schema APIs, such as Avro etc.. However I >> think we >> > should give Json a fair evaluation before dismissing it. >> > >> > >> > It is a wider topic than schema. Actually schema are not the first >> citizen but a >> > generic data representation is. That is where json hits almost any >> other API. >> > Then, when it comes to schema, json has a standard for that so we are >> all good. >> > >> > Also json has a good indexing API compared to alternatives which are >> sometimes a >> > bit faster - for noop transforms - but are hardly usable or make the >> code not >> > that readable. >> > >> > Avro is a nice competitor but it is compatible - actually avro is json >> driven by >> > design - but its API is far to be that easy due to its schema >> enforcement which >> > is heavvvyyy and worse is you cant work with avro without a schema. >> Json would >> > allow to reconciliate the dynamic and static cases since the job >> wouldnt change >> > except the setschema. >> > >> > That is why I think json is a good compromise and having a standard API >> for it >> > allow to fully customize the imol as will if needed - even using avro >> or protobuf. >> > >> > Side note on beam api: i dont think it is good to use a main API for >> runner >> > optimization. It enforces something to be shared on all runners but not >> widely >> > usable. It is also misleading for users. Would you set a flink pipeline >> option >> > with dataflow? My proposal here is to use hints - properties - instead >> of >> > something hardly defined in the API then standardize it if all runners >> support it. >> > >> > >> > >> > Wdyt? >> > >> > Le 29 janv. 2018 06:24, "Jean-Baptiste Onofré" <j...@nanthrax.net >> > <mailto:j...@nanthrax.net>> a écrit : >> > >> > Hi Reuven, >> > >> > Thanks for the update ! As I'm working with you on this, I >> fully >> > agree and great >> > doc gathering the ideas. >> > >> > It's clearly something we have to add asap in Beam, because >> it would >> > allow new >> > use cases for our users (in a simple way) and open new >> areas for the >> > runners >> > (for instance dataframe support in the Spark runner). >> > >> > By the way, while ago, I created BEAM-3437 to track the >> PoC/PR >> > around this. >> > >> > Thanks ! >> > >> > Regards >> > JB >> > >> > On 01/29/2018 02:08 AM, Reuven Lax wrote: >> > > Previously I submitted a proposal for adding schemas as a >> > first-class concept on >> > > Beam PCollections. The proposal engendered quite a bit of >> > discussion from the >> > > community - more discussion than I've seen from almost >> any of our >> > proposals to >> > > date! >> > > >> > > Based on the feedback and comments, I reworked the >> proposal >> > document quite a >> > > bit. It now talks more explicitly about the different >> between >> > dynamic schemas >> > > (where the schema is not fully not know at graph-creation >> time), >> > and static >> > > schemas (which are fully know at graph-creation time). >> Proposed >> > APIs are more >> > > fleshed out now (again thanks to feedback from community >> members), >> > and the >> > > document talks in more detail about evolving schemas in >> > long-running streaming >> > > pipelines. >> > > >> > > Please take a look. I think this will be very valuable to >> Beam, >> > and welcome any >> > > feedback. >> > > >> > > >> > https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm >> Q12pHGK0QIvXS1FOTgRc/edit# >> > <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruU >> mQ12pHGK0QIvXS1FOTgRc/edit#> >> > > >> > > Reuven >> > >> > -- >> > Jean-Baptiste Onofré >> > jbono...@apache.org <mailto:jbono...@apache.org> >> > http://blog.nanthrax.net >> > Talend - http://www.talend.com >> > >> > >> > >> >> -- >> Jean-Baptiste Onofré >> jbono...@apache.org >> http://blog.nanthrax.net >> Talend - http://www.talend.com >> > > >