Re: Schema-Aware PCollections revisited

Reuven Lax Sat, 03 Feb 2018 18:31:39 -0800

One more thing. If anyone here has experience with various OSS metadata
stores (e.g. Kafka Schema Registry is one example), would you like to
collaborate on implementation? I want to make sure that source schemas can
be stored in a variety of OSS metadata stores, and be easily pulled into a
Beam pipeline.


Reuven

On Sat, Feb 3, 2018 at 6:28 PM, Reuven Lax <[email protected]> wrote:

> Hi all,
>
> If there are no concerns, I would like to start working on a prototype.
> It's just a prototype, so I don't think it will have the final API (e.g.
> for the prototype I'm going to avoid change the API of PCollection, and use
> a "special" Coder instead). Also even once we go beyond prototype, it will
> be @Experimental for some time, so the API will not be fixed in stone.
>
> Any more comments on this approach before we start implementing a
> prototype?
>
> Reuven
>
> On Wed, Jan 31, 2018 at 1:12 PM, Romain Manni-Bucau <[email protected]
> > wrote:
>
>> If you need help on the json part I'm happy to help. To give a few hints
>> on what is very doable: we can add an avro module to johnzon (asf json{p,b}
>> impl) to back jsonp by avro (guess it will be one of the first to be asked)
>> for instance.
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau>
>>
>> 2018-01-31 22:06 GMT+01:00 Reuven Lax <[email protected]>:
>>
>>> Agree. The initial implementation will be a prototype.
>>>
>>> On Wed, Jan 31, 2018 at 12:21 PM, Jean-Baptiste Onofré <[email protected]>
>>> wrote:
>>>
>>>> Hi Reuven,
>>>>
>>>> Agree to be able to describe the schema with different format. The good
>>>> point about json schemas is that they are described by a spec. My point is
>>>> also to avoid the reinvent the wheel. Just an abstract to be able to use
>>>> Avro, Json, Calcite, custom schema descriptors would be great.
>>>>
>>>> Using coder to describe a schema sounds like a smart move to implement
>>>> quickly. However, it has to be clear in term of documentation to avoid
>>>> "side effect". I still think PCollection.setSchema() is better: it should
>>>> be metadata (or hint ;))) on the PCollection.
>>>>
>>>> Regards
>>>> JB
>>>>
>>>> On 31/01/2018 20:16, Reuven Lax wrote:
>>>>
>>>>> As to the question of how a schema should be specified, I want to
>>>>> support several common schema formats. So if a user has a Json schema, or
>>>>> an Avro schema, or a Calcite schema, etc. there should be adapters that
>>>>> allow setting a schema from any of them. I don't think we should prefer 
>>>>> one
>>>>> over the other. While Romain is right that many people know Json, I think
>>>>> far fewer people know Json schemas.
>>>>>
>>>>> Agree, schemas should not be enforced (for one thing, that wouldn't be
>>>>> backwards compatible!). I think for the initial prototype I will probably
>>>>> use a special coder to represent the schema (with setSchema an option on
>>>>> the coder), largely because it doesn't require modifying PCollection.
>>>>> However I think longer term a schema should be an optional piece of
>>>>> metadata on the PCollection object. Similar to the previous discussion
>>>>> about "hints," I think this can be set on the producing PTransform, and a
>>>>> SetSchema PTransform will allow attaching a schema to any PCollection 
>>>>> (i.e.
>>>>> pc.apply(SetSchema.of(schema))). This part isn't designed yet, but I
>>>>> think schema should be similar to hints, it's just another piece of
>>>>> metadata on the PCollection (though something interpreted by the model,
>>>>> where hints are interpreted by the runner)
>>>>>
>>>>> Reuven
>>>>>
>>>>> On Tue, Jan 30, 2018 at 1:37 AM, Jean-Baptiste Onofré <[email protected]
>>>>> <mailto:[email protected]>> wrote:
>>>>>
>>>>>     Hi,
>>>>>
>>>>>     I think we should avoid to mix two things in the discussion (and so
>>>>>     the document):
>>>>>
>>>>>     1. The element of the collection and the schema itself are two
>>>>>     different things.
>>>>>     By essence, Beam should not enforce any schema. That's why I think
>>>>>     it's a good
>>>>>     idea to set the schema optionally on the PCollection
>>>>>     (pcollection.setSchema()).
>>>>>
>>>>>     2. From point 1 comes two questions: how do we represent a schema ?
>>>>>     How can we
>>>>>     leverage the schema to simplify the serialization of the element
>>>>> in the
>>>>>     PCollection and query ? These two questions are not directly
>>>>> related.
>>>>>
>>>>>       2.1 How do we represent the schema
>>>>>     Json Schema is a very interesting idea. It could be an abstract and
>>>>>     other
>>>>>     providers, like Avro, can be bind on it. It's part of the json
>>>>>     processing spec
>>>>>     (javax).
>>>>>
>>>>>       2.2. How do we leverage the schema for query and serialization
>>>>>     Also in the spec, json pointer is interesting for the querying.
>>>>>     Regarding the
>>>>>     serialization, jackson or other data binder can be used.
>>>>>
>>>>>     It's still rough ideas in my mind, but I like Romain's idea about
>>>>>     json-p usage.
>>>>>
>>>>>     Once 2.3.0 release is out, I will start to update the document with
>>>>>     those ideas,
>>>>>     and PoC.
>>>>>
>>>>>     Thanks !
>>>>>     Regards
>>>>>     JB
>>>>>
>>>>>     On 01/30/2018 08:42 AM, Romain Manni-Bucau wrote:
>>>>>     >
>>>>>     >
>>>>>     > Le 30 janv. 2018 01:09, "Reuven Lax" <[email protected] <mailto:
>>>>> [email protected]>
>>>>>      > <mailto:[email protected] <mailto:[email protected]>>> a écrit :
>>>>>     >
>>>>>     >
>>>>>     >
>>>>>     >     On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau <
>>>>> [email protected] <mailto:[email protected]>
>>>>>      >     <mailto:[email protected]
>>>>>
>>>>>     <mailto:[email protected]>>> wrote:
>>>>>      >
>>>>>      >         Hi
>>>>>      >
>>>>>      >         I have some questions on this: how hierarchic schemas
>>>>>     would work? Seems
>>>>>      >         it is not really supported by the ecosystem (out of
>>>>>     custom stuff) :(.
>>>>>      >         How would it integrate smoothly with other generic
>>>>> record
>>>>>     types - N bridges?
>>>>>      >
>>>>>      >
>>>>>      >     Do you mean nested schemas? What do you mean here?
>>>>>      >
>>>>>      >
>>>>>      > Yes, sorry - wrote the mail too late ;). Was hierarchic data and
>>>>>     nested schemas.
>>>>>      >
>>>>>      >
>>>>>      >         Concretely I wonder if using json API couldnt be
>>>>>     beneficial: json-p is a
>>>>>      >         nice generic abstraction with a built in querying
>>>>>     mecanism (jsonpointer)
>>>>>      >         but no actual serialization (even if json and binary
>>>>> json
>>>>>     are very
>>>>>      >         natural). The big advantage is to have a well known
>>>>>     ecosystem - who
>>>>>      >         doesnt know json today? - that beam can reuse for free:
>>>>>     JsonObject
>>>>>      >         (guess we dont want JsonValue abstraction) for the
>>>>> record
>>>>>     type,
>>>>>      >         jsonschema standard for the schema, jsonpointer for the
>>>>>      >         delection/projection etc... It doesnt enforce the actual
>>>>>     serialization
>>>>>      >         (json, smile, avro, ...) but provide an expressive and
>>>>>     alread known API
>>>>>      >         so i see it as a big win-win for users (no need to learn
>>>>>     a new API and
>>>>>      >         use N bridges in all ways) and beam (impls are here and
>>>>>     API design
>>>>>      >         already thought).
>>>>>      >
>>>>>      >
>>>>>      >     I assume you're talking about the API for setting schemas,
>>>>>     not using them.
>>>>>      >     Json has many downsides and I'm not sure it's true that
>>>>>     everyone knows it;
>>>>>      >     there are also competing schema APIs, such as Avro etc..
>>>>>     However I think we
>>>>>      >     should give Json a fair evaluation before dismissing it.
>>>>>      >
>>>>>      >
>>>>>      > It is a wider topic than schema. Actually schema are not the
>>>>>     first citizen but a
>>>>>      > generic data representation is. That is where json hits almost
>>>>>     any other API.
>>>>>      > Then, when it comes to schema, json has a standard for that so
>>>>> we
>>>>>     are all good.
>>>>>      >
>>>>>      > Also json has a good indexing API compared to alternatives which
>>>>>     are sometimes a
>>>>>      > bit faster - for noop transforms - but are hardly usable or make
>>>>>     the code not
>>>>>      > that readable.
>>>>>      >
>>>>>      > Avro is a nice competitor but it is compatible - actually avro
>>>>> is
>>>>>     json driven by
>>>>>      > design - but its API is far to be that easy due to its schema
>>>>>     enforcement which
>>>>>      > is heavvvyyy and worse is you cant work with avro without a
>>>>>     schema. Json would
>>>>>      > allow to reconciliate the dynamic and static cases since the job
>>>>>     wouldnt change
>>>>>      > except the setschema.
>>>>>      >
>>>>>      > That is why I think json is a good compromise and having a
>>>>>     standard API for it
>>>>>      > allow to fully customize the imol as will if needed - even using
>>>>>     avro or protobuf.
>>>>>      >
>>>>>      > Side note on beam api: i dont think it is good to use a main API
>>>>>     for runner
>>>>>      > optimization. It enforces something to be shared on all runners
>>>>>     but not widely
>>>>>      > usable. It is also misleading for users. Would you set a flink
>>>>>     pipeline option
>>>>>      > with dataflow? My proposal here is to use hints - properties -
>>>>>     instead of
>>>>>      > something hardly defined in the API then standardize it if all
>>>>>     runners support it.
>>>>>      >
>>>>>      >
>>>>>      >
>>>>>      >         Wdyt?
>>>>>      >
>>>>>      >         Le 29 janv. 2018 06:24, "Jean-Baptiste Onofré"
>>>>>     <[email protected] <mailto:[email protected]>
>>>>>      >         <mailto:[email protected] <mailto:[email protected]>>> a
>>>>> écrit :
>>>>>
>>>>>      >
>>>>>      >             Hi Reuven,
>>>>>      >
>>>>>      >             Thanks for the update ! As I'm working with you on
>>>>>     this, I fully
>>>>>      >             agree and great
>>>>>      >             doc gathering the ideas.
>>>>>      >
>>>>>      >             It's clearly something we have to add asap in Beam,
>>>>>     because it would
>>>>>      >             allow new
>>>>>      >             use cases for our users (in a simple way) and open
>>>>>     new areas for the
>>>>>      >             runners
>>>>>      >             (for instance dataframe support in the Spark
>>>>> runner).
>>>>>      >
>>>>>      >             By the way, while ago, I created BEAM-3437 to track
>>>>>     the PoC/PR
>>>>>      >             around this.
>>>>>      >
>>>>>      >             Thanks !
>>>>>      >
>>>>>      >             Regards
>>>>>      >             JB
>>>>>      >
>>>>>      >             On 01/29/2018 02:08 AM, Reuven Lax wrote:
>>>>>      >             > Previously I submitted a proposal for adding
>>>>>     schemas as a
>>>>>      >             first-class concept on
>>>>>      >             > Beam PCollections. The proposal engendered quite a
>>>>>     bit of
>>>>>      >             discussion from the
>>>>>      >             > community - more discussion than I've seen from
>>>>>     almost any of our
>>>>>      >             proposals to
>>>>>      >             > date!
>>>>>      >             >
>>>>>      >             > Based on the feedback and comments, I reworked the
>>>>>     proposal
>>>>>      >             document quite a
>>>>>      >             > bit. It now talks more explicitly about the
>>>>>     different between
>>>>>      >             dynamic schemas
>>>>>      >             > (where the schema is not fully not know at
>>>>>     graph-creation time),
>>>>>      >             and static
>>>>>      >             > schemas (which are fully know at graph-creation
>>>>>     time). Proposed
>>>>>      >             APIs are more
>>>>>      >             > fleshed out now (again thanks to feedback from
>>>>>     community members),
>>>>>      >             and the
>>>>>      >             > document talks in more detail about evolving
>>>>> schemas in
>>>>>      >             long-running streaming
>>>>>      >             > pipelines.
>>>>>      >             >
>>>>>      >             > Please take a look. I think this will be very
>>>>>     valuable to Beam,
>>>>>      >             and welcome any
>>>>>      >             > feedback.
>>>>>      >             >
>>>>>      >             >
>>>>>      >
>>>>>     https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ
>>>>> 12pHGK0QIvXS1FOTgRc/edit#
>>>>>     <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm
>>>>> Q12pHGK0QIvXS1FOTgRc/edit#>
>>>>>      >                 <https://docs.google.com/docu
>>>>> ment/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit# <
>>>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm
>>>>> Q12pHGK0QIvXS1FOTgRc/edit#>>
>>>>>      >             >
>>>>>      >             > Reuven
>>>>>      >
>>>>>      >             --
>>>>>      >             Jean-Baptiste Onofré
>>>>>      > [email protected] <mailto:[email protected]>
>>>>>     <mailto:[email protected] <mailto:[email protected]>>
>>>>>      > http://blog.nanthrax.net
>>>>>      >             Talend - http://www.talend.com
>>>>>      >
>>>>>      >
>>>>>      >
>>>>>
>>>>>     --
>>>>>     Jean-Baptiste Onofré
>>>>>     [email protected] <mailto:[email protected]>
>>>>>     http://blog.nanthrax.net
>>>>>     Talend - http://www.talend.com
>>>>>
>>>>>
>>>>>
>>>
>>
>

Re: Schema-Aware PCollections revisited

Reply via email to