Re: Schema-Aware PCollections revisited

Jean-Baptiste Onofré Wed, 31 Jan 2018 12:21:51 -0800

Hi Reuven,

Agree to be able to describe the schema with different format. The goodpoint about json schemas is that they are described by a spec. My pointis also to avoid the reinvent the wheel. Just an abstract to be able touse Avro, Json, Calcite, custom schema descriptors would be great.

Using coder to describe a schema sounds like a smart move to implementquickly. However, it has to be clear in term of documentation to avoid"side effect". I still think PCollection.setSchema() is better: itshould be metadata (or hint ;))) on the PCollection.


Regards
JB

On 31/01/2018 20:16, Reuven Lax wrote:

As to the question of how a schema should be specified, I want tosupport several common schema formats. So if a user has a Json schema,or an Avro schema, or a Calcite schema, etc. there should be adaptersthat allow setting a schema from any of them. I don't think we shouldprefer one over the other. While Romain is right that many people knowJson, I think far fewer people know Json schemas.

Agree, schemas should not be enforced (for one thing, that wouldn't bebackwards compatible!). I think for the initial prototype I willprobably use a special coder to represent the schema (with setSchema anoption on the coder), largely because it doesn't require modifyingPCollection. However I think longer term a schema should be an optionalpiece of metadata on the PCollection object. Similar to the previousdiscussion about "hints," I think this can be set on the producingPTransform, and a SetSchema PTransform will allow attaching a schema toany PCollection (i.e. pc.apply(SetSchema.of(schema))). This part isn'tdesigned yet, but I think schema should be similar to hints, it's justanother piece of metadata on the PCollection (though somethinginterpreted by the model, where hints are interpreted by the runner)


Reuven

On Tue, Jan 30, 2018 at 1:37 AM, Jean-Baptiste Onofré <[email protected]<mailto:[email protected]>> wrote:


    Hi,

    I think we should avoid to mix two things in the discussion (and so
    the document):

    1. The element of the collection and the schema itself are two
    different things.
    By essence, Beam should not enforce any schema. That's why I think
    it's a good
    idea to set the schema optionally on the PCollection
    (pcollection.setSchema()).

    2. From point 1 comes two questions: how do we represent a schema ?
    How can we
    leverage the schema to simplify the serialization of the element in the
    PCollection and query ? These two questions are not directly related.

      2.1 How do we represent the schema
    Json Schema is a very interesting idea. It could be an abstract and
    other
    providers, like Avro, can be bind on it. It's part of the json
    processing spec
    (javax).

      2.2. How do we leverage the schema for query and serialization
    Also in the spec, json pointer is interesting for the querying.
    Regarding the
    serialization, jackson or other data binder can be used.

    It's still rough ideas in my mind, but I like Romain's idea about
    json-p usage.

    Once 2.3.0 release is out, I will start to update the document with
    those ideas,
    and PoC.

    Thanks !
    Regards
    JB

    On 01/30/2018 08:42 AM, Romain Manni-Bucau wrote:
    >
    >
    > Le 30 janv. 2018 01:09, "Reuven Lax" <[email protected] 
<mailto:[email protected]>
     > <mailto:[email protected] <mailto:[email protected]>>> a écrit :
    >
    >
    >
    >     On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau <[email protected] 
<mailto:[email protected]>
     >     <mailto:[email protected]
    <mailto:[email protected]>>> wrote:
     >
     >         Hi
     >
     >         I have some questions on this: how hierarchic schemas
    would work? Seems
     >         it is not really supported by the ecosystem (out of
    custom stuff) :(.
     >         How would it integrate smoothly with other generic record
    types - N bridges?
     >
     >
     >     Do you mean nested schemas? What do you mean here?
     >
     >
     > Yes, sorry - wrote the mail too late ;). Was hierarchic data and
    nested schemas.
     >
     >
     >         Concretely I wonder if using json API couldnt be
    beneficial: json-p is a
     >         nice generic abstraction with a built in querying
    mecanism (jsonpointer)
     >         but no actual serialization (even if json and binary json
    are very
     >         natural). The big advantage is to have a well known
    ecosystem - who
     >         doesnt know json today? - that beam can reuse for free:
    JsonObject
     >         (guess we dont want JsonValue abstraction) for the record
    type,
     >         jsonschema standard for the schema, jsonpointer for the
     >         delection/projection etc... It doesnt enforce the actual
    serialization
     >         (json, smile, avro, ...) but provide an expressive and
    alread known API
     >         so i see it as a big win-win for users (no need to learn
    a new API and
     >         use N bridges in all ways) and beam (impls are here and
    API design
     >         already thought).
     >
     >
     >     I assume you're talking about the API for setting schemas,
    not using them.
     >     Json has many downsides and I'm not sure it's true that
    everyone knows it;
     >     there are also competing schema APIs, such as Avro etc..
    However I think we
     >     should give Json a fair evaluation before dismissing it.
     >
     >
     > It is a wider topic than schema. Actually schema are not the
    first citizen but a
     > generic data representation is. That is where json hits almost
    any other API.
     > Then, when it comes to schema, json has a standard for that so we
    are all good.
     >
     > Also json has a good indexing API compared to alternatives which
    are sometimes a
     > bit faster - for noop transforms - but are hardly usable or make
    the code not
     > that readable.
     >
     > Avro is a nice competitor but it is compatible - actually avro is
    json driven by
     > design - but its API is far to be that easy due to its schema
    enforcement which
     > is heavvvyyy and worse is you cant work with avro without a
    schema. Json would
     > allow to reconciliate the dynamic and static cases since the job
    wouldnt change
     > except the setschema.
     >
     > That is why I think json is a good compromise and having a
    standard API for it
     > allow to fully customize the imol as will if needed - even using
    avro or protobuf.
     >
     > Side note on beam api: i dont think it is good to use a main API
    for runner
     > optimization. It enforces something to be shared on all runners
    but not widely
     > usable. It is also misleading for users. Would you set a flink
    pipeline option
     > with dataflow? My proposal here is to use hints - properties -
    instead of
     > something hardly defined in the API then standardize it if all
    runners support it.
     >
     >
     >
     >         Wdyt?
     >
     >         Le 29 janv. 2018 06:24, "Jean-Baptiste Onofré"
    <[email protected] <mailto:[email protected]>
     >         <mailto:[email protected] <mailto:[email protected]>>> a écrit :
     >
     >             Hi Reuven,
     >
     >             Thanks for the update ! As I'm working with you on
    this, I fully
     >             agree and great
     >             doc gathering the ideas.
     >
     >             It's clearly something we have to add asap in Beam,
    because it would
     >             allow new
     >             use cases for our users (in a simple way) and open
    new areas for the
     >             runners
     >             (for instance dataframe support in the Spark runner).
     >
     >             By the way, while ago, I created BEAM-3437 to track
    the PoC/PR
     >             around this.
     >
     >             Thanks !
     >
     >             Regards
     >             JB
     >
     >             On 01/29/2018 02:08 AM, Reuven Lax wrote:
     >             > Previously I submitted a proposal for adding
    schemas as a
     >             first-class concept on
     >             > Beam PCollections. The proposal engendered quite a
    bit of
     >             discussion from the
     >             > community - more discussion than I've seen from
    almost any of our
     >             proposals to
     >             > date!
     >             >
     >             > Based on the feedback and comments, I reworked the
    proposal
     >             document quite a
     >             > bit. It now talks more explicitly about the
    different between
     >             dynamic schemas
     >             > (where the schema is not fully not know at
    graph-creation time),
     >             and static
     >             > schemas (which are fully know at graph-creation
    time). Proposed
     >             APIs are more
     >             > fleshed out now (again thanks to feedback from
    community members),
     >             and the
     >             > document talks in more detail about evolving schemas in
     >             long-running streaming
     >             > pipelines.
     >             >
     >             > Please take a look. I think this will be very
    valuable to Beam,
     >             and welcome any
     >             > feedback.
     >             >
     >             >
     >
    
https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#
    
<https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#>

> <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit# <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#>>

     >             >
     >             > Reuven
     >
     >             --
     >             Jean-Baptiste Onofré
     > [email protected] <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>
     > http://blog.nanthrax.net
     >             Talend - http://www.talend.com
     >
     >
     >

    --
    Jean-Baptiste Onofré
    [email protected] <mailto:[email protected]>
    http://blog.nanthrax.net
    Talend - http://www.talend.com

Re: Schema-Aware PCollections revisited

Reply via email to