Re: Schema-Aware PCollections revisited

Reuven Lax Wed, 31 Jan 2018 11:45:34 -0800

I don't think "hint" is the right API, as schema is not a hint (it has
semantic meaning). However I think the API for schema should look similar
to any "hint" API.


On Wed, Jan 31, 2018 at 11:40 AM, Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

>
>
> Le 31 janv. 2018 20:16, "Reuven Lax" <re...@google.com> a écrit :
>
> As to the question of how a schema should be specified, I want to support
> several common schema formats. So if a user has a Json schema, or an Avro
> schema, or a Calcite schema, etc. there should be adapters that allow
> setting a schema from any of them. I don't think we should prefer one over
> the other. While Romain is right that many people know Json, I think far
> fewer people know Json schemas.
>
>
> Agree but schema would get an API for beam usage - dont think there is a
> standard we can use and we cant use any vendor specific api in beam - so
> not a big deal IMO/not a blocker.
>
>
>
> Agree, schemas should not be enforced (for one thing, that wouldn't be
> backwards compatible!). I think for the initial prototype I will probably
> use a special coder to represent the schema (with setSchema an option on
> the coder), largely because it doesn't require modifying PCollection.
> However I think longer term a schema should be an optional piece of
> metadata on the PCollection object. Similar to the previous discussion
> about "hints," I think this can be set on the producing PTransform, and a
> SetSchema PTransform will allow attaching a schema to any PCollection (i.e.
> pc.apply(SetSchema.of(schema))). This part isn't designed yet, but I
> think schema should be similar to hints, it's just another piece of
> metadata on the PCollection (though something interpreted by the model,
> where hints are interpreted by the runner)
>
>
> Schema should probably be contributable from the transform when mandatory
> - thinking of avro io here - or an hint as fallback when optional probably.
> This sounds good to me and doesnt require another public API than hint.
>
>
> Reuven
>
> On Tue, Jan 30, 2018 at 1:37 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
>> Hi,
>>
>> I think we should avoid to mix two things in the discussion (and so the
>> document):
>>
>> 1. The element of the collection and the schema itself are two different
>> things.
>> By essence, Beam should not enforce any schema. That's why I think it's a
>> good
>> idea to set the schema optionally on the PCollection
>> (pcollection.setSchema()).
>>
>> 2. From point 1 comes two questions: how do we represent a schema ? How
>> can we
>> leverage the schema to simplify the serialization of the element in the
>> PCollection and query ? These two questions are not directly related.
>>
>>  2.1 How do we represent the schema
>> Json Schema is a very interesting idea. It could be an abstract and other
>> providers, like Avro, can be bind on it. It's part of the json processing
>> spec
>> (javax).
>>
>>  2.2. How do we leverage the schema for query and serialization
>> Also in the spec, json pointer is interesting for the querying. Regarding
>> the
>> serialization, jackson or other data binder can be used.
>>
>> It's still rough ideas in my mind, but I like Romain's idea about json-p
>> usage.
>>
>> Once 2.3.0 release is out, I will start to update the document with those
>> ideas,
>> and PoC.
>>
>> Thanks !
>> Regards
>> JB
>>
>> On 01/30/2018 08:42 AM, Romain Manni-Bucau wrote:
>> >
>> >
>> > Le 30 janv. 2018 01:09, "Reuven Lax" <re...@google.com
>> > <mailto:re...@google.com>> a écrit :
>> >
>> >
>> >
>> >     On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau <
>> rmannibu...@gmail.com
>> >     <mailto:rmannibu...@gmail.com>> wrote:
>> >
>> >         Hi
>> >
>> >         I have some questions on this: how hierarchic schemas would
>> work? Seems
>> >         it is not really supported by the ecosystem (out of custom
>> stuff) :(.
>> >         How would it integrate smoothly with other generic record types
>> - N bridges?
>> >
>> >
>> >     Do you mean nested schemas? What do you mean here?
>> >
>> >
>> > Yes, sorry - wrote the mail too late ;). Was hierarchic data and nested
>> schemas.
>> >
>> >
>> >         Concretely I wonder if using json API couldnt be beneficial:
>> json-p is a
>> >         nice generic abstraction with a built in querying mecanism
>> (jsonpointer)
>> >         but no actual serialization (even if json and binary json are
>> very
>> >         natural). The big advantage is to have a well known ecosystem -
>> who
>> >         doesnt know json today? - that beam can reuse for free:
>> JsonObject
>> >         (guess we dont want JsonValue abstraction) for the record type,
>> >         jsonschema standard for the schema, jsonpointer for the
>> >         delection/projection etc... It doesnt enforce the actual
>> serialization
>> >         (json, smile, avro, ...) but provide an expressive and alread
>> known API
>> >         so i see it as a big win-win for users (no need to learn a new
>> API and
>> >         use N bridges in all ways) and beam (impls are here and API
>> design
>> >         already thought).
>> >
>> >
>> >     I assume you're talking about the API for setting schemas, not
>> using them.
>> >     Json has many downsides and I'm not sure it's true that everyone
>> knows it;
>> >     there are also competing schema APIs, such as Avro etc.. However I
>> think we
>> >     should give Json a fair evaluation before dismissing it.
>> >
>> >
>> > It is a wider topic than schema. Actually schema are not the first
>> citizen but a
>> > generic data representation is. That is where json hits almost any
>> other API.
>> > Then, when it comes to schema, json has a standard for that so we are
>> all good.
>> >
>> > Also json has a good indexing API compared to alternatives which are
>> sometimes a
>> > bit faster - for noop transforms - but are hardly usable or make the
>> code not
>> > that readable.
>> >
>> > Avro is a nice competitor but it is compatible - actually avro is json
>> driven by
>> > design - but its API is far to be that easy due to its schema
>> enforcement which
>> > is heavvvyyy and worse is you cant work with avro without a schema.
>> Json would
>> > allow to reconciliate the dynamic and static cases since the job
>> wouldnt change
>> > except the setschema.
>> >
>> > That is why I think json is a good compromise and having a standard API
>> for it
>> > allow to fully customize the imol as will if needed - even using avro
>> or protobuf.
>> >
>> > Side note on beam api: i dont think it is good to use a main API for
>> runner
>> > optimization. It enforces something to be shared on all runners but not
>> widely
>> > usable. It is also misleading for users. Would you set a flink pipeline
>> option
>> > with dataflow? My proposal here is to use hints - properties - instead
>> of
>> > something hardly defined in the API then standardize it if all runners
>> support it.
>> >
>> >
>> >
>> >         Wdyt?
>> >
>> >         Le 29 janv. 2018 06:24, "Jean-Baptiste Onofré" <j...@nanthrax.net
>> >         <mailto:j...@nanthrax.net>> a écrit :
>> >
>> >             Hi Reuven,
>> >
>> >             Thanks for the update ! As I'm working with you on this, I
>> fully
>> >             agree and great
>> >             doc gathering the ideas.
>> >
>> >             It's clearly something we have to add asap in Beam, because
>> it would
>> >             allow new
>> >             use cases for our users (in a simple way) and open new
>> areas for the
>> >             runners
>> >             (for instance dataframe support in the Spark runner).
>> >
>> >             By the way, while ago, I created BEAM-3437 to track the
>> PoC/PR
>> >             around this.
>> >
>> >             Thanks !
>> >
>> >             Regards
>> >             JB
>> >
>> >             On 01/29/2018 02:08 AM, Reuven Lax wrote:
>> >             > Previously I submitted a proposal for adding schemas as a
>> >             first-class concept on
>> >             > Beam PCollections. The proposal engendered quite a bit of
>> >             discussion from the
>> >             > community - more discussion than I've seen from almost
>> any of our
>> >             proposals to
>> >             > date!
>> >             >
>> >             > Based on the feedback and comments, I reworked the
>> proposal
>> >             document quite a
>> >             > bit. It now talks more explicitly about the different
>> between
>> >             dynamic schemas
>> >             > (where the schema is not fully not know at graph-creation
>> time),
>> >             and static
>> >             > schemas (which are fully know at graph-creation time).
>> Proposed
>> >             APIs are more
>> >             > fleshed out now (again thanks to feedback from community
>> members),
>> >             and the
>> >             > document talks in more detail about evolving schemas in
>> >             long-running streaming
>> >             > pipelines.
>> >             >
>> >             > Please take a look. I think this will be very valuable to
>> Beam,
>> >             and welcome any
>> >             > feedback.
>> >             >
>> >             >
>> >             https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm
>> Q12pHGK0QIvXS1FOTgRc/edit#
>> >             <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruU
>> mQ12pHGK0QIvXS1FOTgRc/edit#>
>> >             >
>> >             > Reuven
>> >
>> >             --
>> >             Jean-Baptiste Onofré
>> >             jbono...@apache.org <mailto:jbono...@apache.org>
>> >             http://blog.nanthrax.net
>> >             Talend - http://www.talend.com
>> >
>> >
>> >
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>
>

Re: Schema-Aware PCollections revisited

Reply via email to