Re: Schema-Aware PCollections revisited

Reuven Lax Wed, 31 Jan 2018 13:07:08 -0800

Agree. The initial implementation will be a prototype.

On Wed, Jan 31, 2018 at 12:21 PM, Jean-Baptiste Onofré <[email protected]>
wrote:


> Hi Reuven,
>
> Agree to be able to describe the schema with different format. The good
> point about json schemas is that they are described by a spec. My point is
> also to avoid the reinvent the wheel. Just an abstract to be able to use
> Avro, Json, Calcite, custom schema descriptors would be great.
>
> Using coder to describe a schema sounds like a smart move to implement
> quickly. However, it has to be clear in term of documentation to avoid
> "side effect". I still think PCollection.setSchema() is better: it should
> be metadata (or hint ;))) on the PCollection.
>
> Regards
> JB
>
> On 31/01/2018 20:16, Reuven Lax wrote:
>
>> As to the question of how a schema should be specified, I want to support
>> several common schema formats. So if a user has a Json schema, or an Avro
>> schema, or a Calcite schema, etc. there should be adapters that allow
>> setting a schema from any of them. I don't think we should prefer one over
>> the other. While Romain is right that many people know Json, I think far
>> fewer people know Json schemas.
>>
>> Agree, schemas should not be enforced (for one thing, that wouldn't be
>> backwards compatible!). I think for the initial prototype I will probably
>> use a special coder to represent the schema (with setSchema an option on
>> the coder), largely because it doesn't require modifying PCollection.
>> However I think longer term a schema should be an optional piece of
>> metadata on the PCollection object. Similar to the previous discussion
>> about "hints," I think this can be set on the producing PTransform, and a
>> SetSchema PTransform will allow attaching a schema to any PCollection (i.e.
>> pc.apply(SetSchema.of(schema))). This part isn't designed yet, but I
>> think schema should be similar to hints, it's just another piece of
>> metadata on the PCollection (though something interpreted by the model,
>> where hints are interpreted by the runner)
>>
>> Reuven
>>
>> On Tue, Jan 30, 2018 at 1:37 AM, Jean-Baptiste Onofré <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>>     Hi,
>>
>>     I think we should avoid to mix two things in the discussion (and so
>>     the document):
>>
>>     1. The element of the collection and the schema itself are two
>>     different things.
>>     By essence, Beam should not enforce any schema. That's why I think
>>     it's a good
>>     idea to set the schema optionally on the PCollection
>>     (pcollection.setSchema()).
>>
>>     2. From point 1 comes two questions: how do we represent a schema ?
>>     How can we
>>     leverage the schema to simplify the serialization of the element in
>> the
>>     PCollection and query ? These two questions are not directly related.
>>
>>       2.1 How do we represent the schema
>>     Json Schema is a very interesting idea. It could be an abstract and
>>     other
>>     providers, like Avro, can be bind on it. It's part of the json
>>     processing spec
>>     (javax).
>>
>>       2.2. How do we leverage the schema for query and serialization
>>     Also in the spec, json pointer is interesting for the querying.
>>     Regarding the
>>     serialization, jackson or other data binder can be used.
>>
>>     It's still rough ideas in my mind, but I like Romain's idea about
>>     json-p usage.
>>
>>     Once 2.3.0 release is out, I will start to update the document with
>>     those ideas,
>>     and PoC.
>>
>>     Thanks !
>>     Regards
>>     JB
>>
>>     On 01/30/2018 08:42 AM, Romain Manni-Bucau wrote:
>>     >
>>     >
>>     > Le 30 janv. 2018 01:09, "Reuven Lax" <[email protected] <mailto:
>> [email protected]>
>>      > <mailto:[email protected] <mailto:[email protected]>>> a écrit :
>>     >
>>     >
>>     >
>>     >     On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau <
>> [email protected] <mailto:[email protected]>
>>      >     <mailto:[email protected]
>>
>>     <mailto:[email protected]>>> wrote:
>>      >
>>      >         Hi
>>      >
>>      >         I have some questions on this: how hierarchic schemas
>>     would work? Seems
>>      >         it is not really supported by the ecosystem (out of
>>     custom stuff) :(.
>>      >         How would it integrate smoothly with other generic record
>>     types - N bridges?
>>      >
>>      >
>>      >     Do you mean nested schemas? What do you mean here?
>>      >
>>      >
>>      > Yes, sorry - wrote the mail too late ;). Was hierarchic data and
>>     nested schemas.
>>      >
>>      >
>>      >         Concretely I wonder if using json API couldnt be
>>     beneficial: json-p is a
>>      >         nice generic abstraction with a built in querying
>>     mecanism (jsonpointer)
>>      >         but no actual serialization (even if json and binary json
>>     are very
>>      >         natural). The big advantage is to have a well known
>>     ecosystem - who
>>      >         doesnt know json today? - that beam can reuse for free:
>>     JsonObject
>>      >         (guess we dont want JsonValue abstraction) for the record
>>     type,
>>      >         jsonschema standard for the schema, jsonpointer for the
>>      >         delection/projection etc... It doesnt enforce the actual
>>     serialization
>>      >         (json, smile, avro, ...) but provide an expressive and
>>     alread known API
>>      >         so i see it as a big win-win for users (no need to learn
>>     a new API and
>>      >         use N bridges in all ways) and beam (impls are here and
>>     API design
>>      >         already thought).
>>      >
>>      >
>>      >     I assume you're talking about the API for setting schemas,
>>     not using them.
>>      >     Json has many downsides and I'm not sure it's true that
>>     everyone knows it;
>>      >     there are also competing schema APIs, such as Avro etc..
>>     However I think we
>>      >     should give Json a fair evaluation before dismissing it.
>>      >
>>      >
>>      > It is a wider topic than schema. Actually schema are not the
>>     first citizen but a
>>      > generic data representation is. That is where json hits almost
>>     any other API.
>>      > Then, when it comes to schema, json has a standard for that so we
>>     are all good.
>>      >
>>      > Also json has a good indexing API compared to alternatives which
>>     are sometimes a
>>      > bit faster - for noop transforms - but are hardly usable or make
>>     the code not
>>      > that readable.
>>      >
>>      > Avro is a nice competitor but it is compatible - actually avro is
>>     json driven by
>>      > design - but its API is far to be that easy due to its schema
>>     enforcement which
>>      > is heavvvyyy and worse is you cant work with avro without a
>>     schema. Json would
>>      > allow to reconciliate the dynamic and static cases since the job
>>     wouldnt change
>>      > except the setschema.
>>      >
>>      > That is why I think json is a good compromise and having a
>>     standard API for it
>>      > allow to fully customize the imol as will if needed - even using
>>     avro or protobuf.
>>      >
>>      > Side note on beam api: i dont think it is good to use a main API
>>     for runner
>>      > optimization. It enforces something to be shared on all runners
>>     but not widely
>>      > usable. It is also misleading for users. Would you set a flink
>>     pipeline option
>>      > with dataflow? My proposal here is to use hints - properties -
>>     instead of
>>      > something hardly defined in the API then standardize it if all
>>     runners support it.
>>      >
>>      >
>>      >
>>      >         Wdyt?
>>      >
>>      >         Le 29 janv. 2018 06:24, "Jean-Baptiste Onofré"
>>     <[email protected] <mailto:[email protected]>
>>      >         <mailto:[email protected] <mailto:[email protected]>>> a
>> écrit :
>>
>>      >
>>      >             Hi Reuven,
>>      >
>>      >             Thanks for the update ! As I'm working with you on
>>     this, I fully
>>      >             agree and great
>>      >             doc gathering the ideas.
>>      >
>>      >             It's clearly something we have to add asap in Beam,
>>     because it would
>>      >             allow new
>>      >             use cases for our users (in a simple way) and open
>>     new areas for the
>>      >             runners
>>      >             (for instance dataframe support in the Spark runner).
>>      >
>>      >             By the way, while ago, I created BEAM-3437 to track
>>     the PoC/PR
>>      >             around this.
>>      >
>>      >             Thanks !
>>      >
>>      >             Regards
>>      >             JB
>>      >
>>      >             On 01/29/2018 02:08 AM, Reuven Lax wrote:
>>      >             > Previously I submitted a proposal for adding
>>     schemas as a
>>      >             first-class concept on
>>      >             > Beam PCollections. The proposal engendered quite a
>>     bit of
>>      >             discussion from the
>>      >             > community - more discussion than I've seen from
>>     almost any of our
>>      >             proposals to
>>      >             > date!
>>      >             >
>>      >             > Based on the feedback and comments, I reworked the
>>     proposal
>>      >             document quite a
>>      >             > bit. It now talks more explicitly about the
>>     different between
>>      >             dynamic schemas
>>      >             > (where the schema is not fully not know at
>>     graph-creation time),
>>      >             and static
>>      >             > schemas (which are fully know at graph-creation
>>     time). Proposed
>>      >             APIs are more
>>      >             > fleshed out now (again thanks to feedback from
>>     community members),
>>      >             and the
>>      >             > document talks in more detail about evolving schemas
>> in
>>      >             long-running streaming
>>      >             > pipelines.
>>      >             >
>>      >             > Please take a look. I think this will be very
>>     valuable to Beam,
>>      >             and welcome any
>>      >             > feedback.
>>      >             >
>>      >             >
>>      >
>>     https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ
>> 12pHGK0QIvXS1FOTgRc/edit#
>>     <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm
>> Q12pHGK0QIvXS1FOTgRc/edit#>
>>      >                 <https://docs.google.com/docu
>> ment/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit# <
>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm
>> Q12pHGK0QIvXS1FOTgRc/edit#>>
>>      >             >
>>      >             > Reuven
>>      >
>>      >             --
>>      >             Jean-Baptiste Onofré
>>      > [email protected] <mailto:[email protected]>
>>     <mailto:[email protected] <mailto:[email protected]>>
>>      > http://blog.nanthrax.net
>>      >             Talend - http://www.talend.com
>>      >
>>      >
>>      >
>>
>>     --
>>     Jean-Baptiste Onofré
>>     [email protected] <mailto:[email protected]>
>>     http://blog.nanthrax.net
>>     Talend - http://www.talend.com
>>
>>
>>

Re: Schema-Aware PCollections revisited

Reply via email to