Re: Schema-Aware PCollections revisited

Jean-Baptiste Onofré Tue, 30 Jan 2018 01:37:27 -0800

Hi,

I think we should avoid to mix two things in the discussion (and so the 
document):


1. The element of the collection and the schema itself are two different things.
By essence, Beam should not enforce any schema. That's why I think it's a good
idea to set the schema optionally on the PCollection (pcollection.setSchema()).

2. From point 1 comes two questions: how do we represent a schema ? How can we
leverage the schema to simplify the serialization of the element in the
PCollection and query ? These two questions are not directly related.

 2.1 How do we represent the schema
Json Schema is a very interesting idea. It could be an abstract and other
providers, like Avro, can be bind on it. It's part of the json processing spec
(javax).

 2.2. How do we leverage the schema for query and serialization
Also in the spec, json pointer is interesting for the querying. Regarding the
serialization, jackson or other data binder can be used.

It's still rough ideas in my mind, but I like Romain's idea about json-p usage.

Once 2.3.0 release is out, I will start to update the document with those ideas,
and PoC.

Thanks !
Regards
JB

On 01/30/2018 08:42 AM, Romain Manni-Bucau wrote:
> 
> 
> Le 30 janv. 2018 01:09, "Reuven Lax" <re...@google.com
> <mailto:re...@google.com>> a écrit :
> 
> 
> 
>     On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau 
> <rmannibu...@gmail.com
>     <mailto:rmannibu...@gmail.com>> wrote:
> 
>         Hi
> 
>         I have some questions on this: how hierarchic schemas would work? 
> Seems
>         it is not really supported by the ecosystem (out of custom stuff) :(.
>         How would it integrate smoothly with other generic record types - N 
> bridges?
> 
> 
>     Do you mean nested schemas? What do you mean here? 
> 
> 
> Yes, sorry - wrote the mail too late ;). Was hierarchic data and nested 
> schemas.
> 
> 
>         Concretely I wonder if using json API couldnt be beneficial: json-p 
> is a
>         nice generic abstraction with a built in querying mecanism 
> (jsonpointer)
>         but no actual serialization (even if json and binary json are very
>         natural). The big advantage is to have a well known ecosystem - who
>         doesnt know json today? - that beam can reuse for free: JsonObject
>         (guess we dont want JsonValue abstraction) for the record type,
>         jsonschema standard for the schema, jsonpointer for the
>         delection/projection etc... It doesnt enforce the actual serialization
>         (json, smile, avro, ...) but provide an expressive and alread known 
> API
>         so i see it as a big win-win for users (no need to learn a new API and
>         use N bridges in all ways) and beam (impls are here and API design
>         already thought).
> 
> 
>     I assume you're talking about the API for setting schemas, not using them.
>     Json has many downsides and I'm not sure it's true that everyone knows it;
>     there are also competing schema APIs, such as Avro etc.. However I think 
> we
>     should give Json a fair evaluation before dismissing it.
> 
> 
> It is a wider topic than schema. Actually schema are not the first citizen 
> but a
> generic data representation is. That is where json hits almost any other API.
> Then, when it comes to schema, json has a standard for that so we are all 
> good.
> 
> Also json has a good indexing API compared to alternatives which are 
> sometimes a
> bit faster - for noop transforms - but are hardly usable or make the code not
> that readable.
> 
> Avro is a nice competitor but it is compatible - actually avro is json driven 
> by
> design - but its API is far to be that easy due to its schema enforcement 
> which
> is heavvvyyy and worse is you cant work with avro without a schema. Json would
> allow to reconciliate the dynamic and static cases since the job wouldnt 
> change
> except the setschema.
> 
> That is why I think json is a good compromise and having a standard API for it
> allow to fully customize the imol as will if needed - even using avro or 
> protobuf.
> 
> Side note on beam api: i dont think it is good to use a main API for runner
> optimization. It enforces something to be shared on all runners but not widely
> usable. It is also misleading for users. Would you set a flink pipeline option
> with dataflow? My proposal here is to use hints - properties - instead of
> something hardly defined in the API then standardize it if all runners 
> support it.
> 
> 
> 
>         Wdyt?
> 
>         Le 29 janv. 2018 06:24, "Jean-Baptiste Onofré" <j...@nanthrax.net
>         <mailto:j...@nanthrax.net>> a écrit :
> 
>             Hi Reuven,
> 
>             Thanks for the update ! As I'm working with you on this, I fully
>             agree and great
>             doc gathering the ideas.
> 
>             It's clearly something we have to add asap in Beam, because it 
> would
>             allow new
>             use cases for our users (in a simple way) and open new areas for 
> the
>             runners
>             (for instance dataframe support in the Spark runner).
> 
>             By the way, while ago, I created BEAM-3437 to track the PoC/PR
>             around this.
> 
>             Thanks !
> 
>             Regards
>             JB
> 
>             On 01/29/2018 02:08 AM, Reuven Lax wrote:
>             > Previously I submitted a proposal for adding schemas as a
>             first-class concept on
>             > Beam PCollections. The proposal engendered quite a bit of
>             discussion from the
>             > community - more discussion than I've seen from almost any of 
> our
>             proposals to
>             > date! 
>             >
>             > Based on the feedback and comments, I reworked the proposal
>             document quite a
>             > bit. It now talks more explicitly about the different between
>             dynamic schemas
>             > (where the schema is not fully not know at graph-creation time),
>             and static
>             > schemas (which are fully know at graph-creation time). Proposed
>             APIs are more
>             > fleshed out now (again thanks to feedback from community 
> members),
>             and the
>             > document talks in more detail about evolving schemas in
>             long-running streaming
>             > pipelines.
>             >
>             > Please take a look. I think this will be very valuable to Beam,
>             and welcome any
>             > feedback.
>             >
>             >
>             
> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#
>             
> <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#>
>             >
>             > Reuven
> 
>             --
>             Jean-Baptiste Onofré
>             jbono...@apache.org <mailto:jbono...@apache.org>
>             http://blog.nanthrax.net
>             Talend - http://www.talend.com
> 
> 
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Schema-Aware PCollections revisited

Reply via email to