If you need help on the json part I'm happy to help. To give a few hints on what is very doable: we can add an avro module to johnzon (asf json{p,b} impl) to back jsonp by avro (guess it will be one of the first to be asked) for instance.
Romain Manni-Bucau @rmannibucau <https://twitter.com/rmannibucau> | Blog <https://rmannibucau.metawerx.net/> | Old Blog <http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> | LinkedIn <https://www.linkedin.com/in/rmannibucau> 2018-01-31 22:06 GMT+01:00 Reuven Lax <re...@google.com>: > Agree. The initial implementation will be a prototype. > > On Wed, Jan 31, 2018 at 12:21 PM, Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > >> Hi Reuven, >> >> Agree to be able to describe the schema with different format. The good >> point about json schemas is that they are described by a spec. My point is >> also to avoid the reinvent the wheel. Just an abstract to be able to use >> Avro, Json, Calcite, custom schema descriptors would be great. >> >> Using coder to describe a schema sounds like a smart move to implement >> quickly. However, it has to be clear in term of documentation to avoid >> "side effect". I still think PCollection.setSchema() is better: it should >> be metadata (or hint ;))) on the PCollection. >> >> Regards >> JB >> >> On 31/01/2018 20:16, Reuven Lax wrote: >> >>> As to the question of how a schema should be specified, I want to >>> support several common schema formats. So if a user has a Json schema, or >>> an Avro schema, or a Calcite schema, etc. there should be adapters that >>> allow setting a schema from any of them. I don't think we should prefer one >>> over the other. While Romain is right that many people know Json, I think >>> far fewer people know Json schemas. >>> >>> Agree, schemas should not be enforced (for one thing, that wouldn't be >>> backwards compatible!). I think for the initial prototype I will probably >>> use a special coder to represent the schema (with setSchema an option on >>> the coder), largely because it doesn't require modifying PCollection. >>> However I think longer term a schema should be an optional piece of >>> metadata on the PCollection object. Similar to the previous discussion >>> about "hints," I think this can be set on the producing PTransform, and a >>> SetSchema PTransform will allow attaching a schema to any PCollection (i.e. >>> pc.apply(SetSchema.of(schema))). This part isn't designed yet, but I >>> think schema should be similar to hints, it's just another piece of >>> metadata on the PCollection (though something interpreted by the model, >>> where hints are interpreted by the runner) >>> >>> Reuven >>> >>> On Tue, Jan 30, 2018 at 1:37 AM, Jean-Baptiste Onofré <j...@nanthrax.net >>> <mailto:j...@nanthrax.net>> wrote: >>> >>> Hi, >>> >>> I think we should avoid to mix two things in the discussion (and so >>> the document): >>> >>> 1. The element of the collection and the schema itself are two >>> different things. >>> By essence, Beam should not enforce any schema. That's why I think >>> it's a good >>> idea to set the schema optionally on the PCollection >>> (pcollection.setSchema()). >>> >>> 2. From point 1 comes two questions: how do we represent a schema ? >>> How can we >>> leverage the schema to simplify the serialization of the element in >>> the >>> PCollection and query ? These two questions are not directly related. >>> >>> 2.1 How do we represent the schema >>> Json Schema is a very interesting idea. It could be an abstract and >>> other >>> providers, like Avro, can be bind on it. It's part of the json >>> processing spec >>> (javax). >>> >>> 2.2. How do we leverage the schema for query and serialization >>> Also in the spec, json pointer is interesting for the querying. >>> Regarding the >>> serialization, jackson or other data binder can be used. >>> >>> It's still rough ideas in my mind, but I like Romain's idea about >>> json-p usage. >>> >>> Once 2.3.0 release is out, I will start to update the document with >>> those ideas, >>> and PoC. >>> >>> Thanks ! >>> Regards >>> JB >>> >>> On 01/30/2018 08:42 AM, Romain Manni-Bucau wrote: >>> > >>> > >>> > Le 30 janv. 2018 01:09, "Reuven Lax" <re...@google.com <mailto: >>> re...@google.com> >>> > <mailto:re...@google.com <mailto:re...@google.com>>> a écrit : >>> > >>> > >>> > >>> > On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau < >>> rmannibu...@gmail.com <mailto:rmannibu...@gmail.com> >>> > <mailto:rmannibu...@gmail.com >>> >>> <mailto:rmannibu...@gmail.com>>> wrote: >>> > >>> > Hi >>> > >>> > I have some questions on this: how hierarchic schemas >>> would work? Seems >>> > it is not really supported by the ecosystem (out of >>> custom stuff) :(. >>> > How would it integrate smoothly with other generic record >>> types - N bridges? >>> > >>> > >>> > Do you mean nested schemas? What do you mean here? >>> > >>> > >>> > Yes, sorry - wrote the mail too late ;). Was hierarchic data and >>> nested schemas. >>> > >>> > >>> > Concretely I wonder if using json API couldnt be >>> beneficial: json-p is a >>> > nice generic abstraction with a built in querying >>> mecanism (jsonpointer) >>> > but no actual serialization (even if json and binary json >>> are very >>> > natural). The big advantage is to have a well known >>> ecosystem - who >>> > doesnt know json today? - that beam can reuse for free: >>> JsonObject >>> > (guess we dont want JsonValue abstraction) for the record >>> type, >>> > jsonschema standard for the schema, jsonpointer for the >>> > delection/projection etc... It doesnt enforce the actual >>> serialization >>> > (json, smile, avro, ...) but provide an expressive and >>> alread known API >>> > so i see it as a big win-win for users (no need to learn >>> a new API and >>> > use N bridges in all ways) and beam (impls are here and >>> API design >>> > already thought). >>> > >>> > >>> > I assume you're talking about the API for setting schemas, >>> not using them. >>> > Json has many downsides and I'm not sure it's true that >>> everyone knows it; >>> > there are also competing schema APIs, such as Avro etc.. >>> However I think we >>> > should give Json a fair evaluation before dismissing it. >>> > >>> > >>> > It is a wider topic than schema. Actually schema are not the >>> first citizen but a >>> > generic data representation is. That is where json hits almost >>> any other API. >>> > Then, when it comes to schema, json has a standard for that so we >>> are all good. >>> > >>> > Also json has a good indexing API compared to alternatives which >>> are sometimes a >>> > bit faster - for noop transforms - but are hardly usable or make >>> the code not >>> > that readable. >>> > >>> > Avro is a nice competitor but it is compatible - actually avro is >>> json driven by >>> > design - but its API is far to be that easy due to its schema >>> enforcement which >>> > is heavvvyyy and worse is you cant work with avro without a >>> schema. Json would >>> > allow to reconciliate the dynamic and static cases since the job >>> wouldnt change >>> > except the setschema. >>> > >>> > That is why I think json is a good compromise and having a >>> standard API for it >>> > allow to fully customize the imol as will if needed - even using >>> avro or protobuf. >>> > >>> > Side note on beam api: i dont think it is good to use a main API >>> for runner >>> > optimization. It enforces something to be shared on all runners >>> but not widely >>> > usable. It is also misleading for users. Would you set a flink >>> pipeline option >>> > with dataflow? My proposal here is to use hints - properties - >>> instead of >>> > something hardly defined in the API then standardize it if all >>> runners support it. >>> > >>> > >>> > >>> > Wdyt? >>> > >>> > Le 29 janv. 2018 06:24, "Jean-Baptiste Onofré" >>> <j...@nanthrax.net <mailto:j...@nanthrax.net> >>> > <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>>> a >>> écrit : >>> >>> > >>> > Hi Reuven, >>> > >>> > Thanks for the update ! As I'm working with you on >>> this, I fully >>> > agree and great >>> > doc gathering the ideas. >>> > >>> > It's clearly something we have to add asap in Beam, >>> because it would >>> > allow new >>> > use cases for our users (in a simple way) and open >>> new areas for the >>> > runners >>> > (for instance dataframe support in the Spark runner). >>> > >>> > By the way, while ago, I created BEAM-3437 to track >>> the PoC/PR >>> > around this. >>> > >>> > Thanks ! >>> > >>> > Regards >>> > JB >>> > >>> > On 01/29/2018 02:08 AM, Reuven Lax wrote: >>> > > Previously I submitted a proposal for adding >>> schemas as a >>> > first-class concept on >>> > > Beam PCollections. The proposal engendered quite a >>> bit of >>> > discussion from the >>> > > community - more discussion than I've seen from >>> almost any of our >>> > proposals to >>> > > date! >>> > > >>> > > Based on the feedback and comments, I reworked the >>> proposal >>> > document quite a >>> > > bit. It now talks more explicitly about the >>> different between >>> > dynamic schemas >>> > > (where the schema is not fully not know at >>> graph-creation time), >>> > and static >>> > > schemas (which are fully know at graph-creation >>> time). Proposed >>> > APIs are more >>> > > fleshed out now (again thanks to feedback from >>> community members), >>> > and the >>> > > document talks in more detail about evolving >>> schemas in >>> > long-running streaming >>> > > pipelines. >>> > > >>> > > Please take a look. I think this will be very >>> valuable to Beam, >>> > and welcome any >>> > > feedback. >>> > > >>> > > >>> > >>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ >>> 12pHGK0QIvXS1FOTgRc/edit# >>> <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm >>> Q12pHGK0QIvXS1FOTgRc/edit#> >>> > <https://docs.google.com/docu >>> ment/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit# < >>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm >>> Q12pHGK0QIvXS1FOTgRc/edit#>> >>> > > >>> > > Reuven >>> > >>> > -- >>> > Jean-Baptiste Onofré >>> > jbono...@apache.org <mailto:jbono...@apache.org> >>> <mailto:jbono...@apache.org <mailto:jbono...@apache.org>> >>> > http://blog.nanthrax.net >>> > Talend - http://www.talend.com >>> > >>> > >>> > >>> >>> -- >>> Jean-Baptiste Onofré >>> jbono...@apache.org <mailto:jbono...@apache.org> >>> http://blog.nanthrax.net >>> Talend - http://www.talend.com >>> >>> >>> >