Re: Schema-Aware PCollections revisited

Romain Manni-Bucau Wed, 31 Jan 2018 13:13:48 -0800

If you need help on the json part I'm happy to help. To give a few hints on
what is very doable: we can add an avro module to johnzon (asf json{p,b}
impl) to back jsonp by avro (guess it will be one of the first to be asked)
for instance.



Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau>

2018-01-31 22:06 GMT+01:00 Reuven Lax <[email protected]>:

> Agree. The initial implementation will be a prototype.
>
> On Wed, Jan 31, 2018 at 12:21 PM, Jean-Baptiste Onofré <[email protected]>
> wrote:
>
>> Hi Reuven,
>>
>> Agree to be able to describe the schema with different format. The good
>> point about json schemas is that they are described by a spec. My point is
>> also to avoid the reinvent the wheel. Just an abstract to be able to use
>> Avro, Json, Calcite, custom schema descriptors would be great.
>>
>> Using coder to describe a schema sounds like a smart move to implement
>> quickly. However, it has to be clear in term of documentation to avoid
>> "side effect". I still think PCollection.setSchema() is better: it should
>> be metadata (or hint ;))) on the PCollection.
>>
>> Regards
>> JB
>>
>> On 31/01/2018 20:16, Reuven Lax wrote:
>>
>>> As to the question of how a schema should be specified, I want to
>>> support several common schema formats. So if a user has a Json schema, or
>>> an Avro schema, or a Calcite schema, etc. there should be adapters that
>>> allow setting a schema from any of them. I don't think we should prefer one
>>> over the other. While Romain is right that many people know Json, I think
>>> far fewer people know Json schemas.
>>>
>>> Agree, schemas should not be enforced (for one thing, that wouldn't be
>>> backwards compatible!). I think for the initial prototype I will probably
>>> use a special coder to represent the schema (with setSchema an option on
>>> the coder), largely because it doesn't require modifying PCollection.
>>> However I think longer term a schema should be an optional piece of
>>> metadata on the PCollection object. Similar to the previous discussion
>>> about "hints," I think this can be set on the producing PTransform, and a
>>> SetSchema PTransform will allow attaching a schema to any PCollection (i.e.
>>> pc.apply(SetSchema.of(schema))). This part isn't designed yet, but I
>>> think schema should be similar to hints, it's just another piece of
>>> metadata on the PCollection (though something interpreted by the model,
>>> where hints are interpreted by the runner)
>>>
>>> Reuven
>>>
>>> On Tue, Jan 30, 2018 at 1:37 AM, Jean-Baptiste Onofré <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>>     Hi,
>>>
>>>     I think we should avoid to mix two things in the discussion (and so
>>>     the document):
>>>
>>>     1. The element of the collection and the schema itself are two
>>>     different things.
>>>     By essence, Beam should not enforce any schema. That's why I think
>>>     it's a good
>>>     idea to set the schema optionally on the PCollection
>>>     (pcollection.setSchema()).
>>>
>>>     2. From point 1 comes two questions: how do we represent a schema ?
>>>     How can we
>>>     leverage the schema to simplify the serialization of the element in
>>> the
>>>     PCollection and query ? These two questions are not directly related.
>>>
>>>       2.1 How do we represent the schema
>>>     Json Schema is a very interesting idea. It could be an abstract and
>>>     other
>>>     providers, like Avro, can be bind on it. It's part of the json
>>>     processing spec
>>>     (javax).
>>>
>>>       2.2. How do we leverage the schema for query and serialization
>>>     Also in the spec, json pointer is interesting for the querying.
>>>     Regarding the
>>>     serialization, jackson or other data binder can be used.
>>>
>>>     It's still rough ideas in my mind, but I like Romain's idea about
>>>     json-p usage.
>>>
>>>     Once 2.3.0 release is out, I will start to update the document with
>>>     those ideas,
>>>     and PoC.
>>>
>>>     Thanks !
>>>     Regards
>>>     JB
>>>
>>>     On 01/30/2018 08:42 AM, Romain Manni-Bucau wrote:
>>>     >
>>>     >
>>>     > Le 30 janv. 2018 01:09, "Reuven Lax" <[email protected] <mailto:
>>> [email protected]>
>>>      > <mailto:[email protected] <mailto:[email protected]>>> a écrit :
>>>     >
>>>     >
>>>     >
>>>     >     On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau <
>>> [email protected] <mailto:[email protected]>
>>>      >     <mailto:[email protected]
>>>
>>>     <mailto:[email protected]>>> wrote:
>>>      >
>>>      >         Hi
>>>      >
>>>      >         I have some questions on this: how hierarchic schemas
>>>     would work? Seems
>>>      >         it is not really supported by the ecosystem (out of
>>>     custom stuff) :(.
>>>      >         How would it integrate smoothly with other generic record
>>>     types - N bridges?
>>>      >
>>>      >
>>>      >     Do you mean nested schemas? What do you mean here?
>>>      >
>>>      >
>>>      > Yes, sorry - wrote the mail too late ;). Was hierarchic data and
>>>     nested schemas.
>>>      >
>>>      >
>>>      >         Concretely I wonder if using json API couldnt be
>>>     beneficial: json-p is a
>>>      >         nice generic abstraction with a built in querying
>>>     mecanism (jsonpointer)
>>>      >         but no actual serialization (even if json and binary json
>>>     are very
>>>      >         natural). The big advantage is to have a well known
>>>     ecosystem - who
>>>      >         doesnt know json today? - that beam can reuse for free:
>>>     JsonObject
>>>      >         (guess we dont want JsonValue abstraction) for the record
>>>     type,
>>>      >         jsonschema standard for the schema, jsonpointer for the
>>>      >         delection/projection etc... It doesnt enforce the actual
>>>     serialization
>>>      >         (json, smile, avro, ...) but provide an expressive and
>>>     alread known API
>>>      >         so i see it as a big win-win for users (no need to learn
>>>     a new API and
>>>      >         use N bridges in all ways) and beam (impls are here and
>>>     API design
>>>      >         already thought).
>>>      >
>>>      >
>>>      >     I assume you're talking about the API for setting schemas,
>>>     not using them.
>>>      >     Json has many downsides and I'm not sure it's true that
>>>     everyone knows it;
>>>      >     there are also competing schema APIs, such as Avro etc..
>>>     However I think we
>>>      >     should give Json a fair evaluation before dismissing it.
>>>      >
>>>      >
>>>      > It is a wider topic than schema. Actually schema are not the
>>>     first citizen but a
>>>      > generic data representation is. That is where json hits almost
>>>     any other API.
>>>      > Then, when it comes to schema, json has a standard for that so we
>>>     are all good.
>>>      >
>>>      > Also json has a good indexing API compared to alternatives which
>>>     are sometimes a
>>>      > bit faster - for noop transforms - but are hardly usable or make
>>>     the code not
>>>      > that readable.
>>>      >
>>>      > Avro is a nice competitor but it is compatible - actually avro is
>>>     json driven by
>>>      > design - but its API is far to be that easy due to its schema
>>>     enforcement which
>>>      > is heavvvyyy and worse is you cant work with avro without a
>>>     schema. Json would
>>>      > allow to reconciliate the dynamic and static cases since the job
>>>     wouldnt change
>>>      > except the setschema.
>>>      >
>>>      > That is why I think json is a good compromise and having a
>>>     standard API for it
>>>      > allow to fully customize the imol as will if needed - even using
>>>     avro or protobuf.
>>>      >
>>>      > Side note on beam api: i dont think it is good to use a main API
>>>     for runner
>>>      > optimization. It enforces something to be shared on all runners
>>>     but not widely
>>>      > usable. It is also misleading for users. Would you set a flink
>>>     pipeline option
>>>      > with dataflow? My proposal here is to use hints - properties -
>>>     instead of
>>>      > something hardly defined in the API then standardize it if all
>>>     runners support it.
>>>      >
>>>      >
>>>      >
>>>      >         Wdyt?
>>>      >
>>>      >         Le 29 janv. 2018 06:24, "Jean-Baptiste Onofré"
>>>     <[email protected] <mailto:[email protected]>
>>>      >         <mailto:[email protected] <mailto:[email protected]>>> a
>>> écrit :
>>>
>>>      >
>>>      >             Hi Reuven,
>>>      >
>>>      >             Thanks for the update ! As I'm working with you on
>>>     this, I fully
>>>      >             agree and great
>>>      >             doc gathering the ideas.
>>>      >
>>>      >             It's clearly something we have to add asap in Beam,
>>>     because it would
>>>      >             allow new
>>>      >             use cases for our users (in a simple way) and open
>>>     new areas for the
>>>      >             runners
>>>      >             (for instance dataframe support in the Spark runner).
>>>      >
>>>      >             By the way, while ago, I created BEAM-3437 to track
>>>     the PoC/PR
>>>      >             around this.
>>>      >
>>>      >             Thanks !
>>>      >
>>>      >             Regards
>>>      >             JB
>>>      >
>>>      >             On 01/29/2018 02:08 AM, Reuven Lax wrote:
>>>      >             > Previously I submitted a proposal for adding
>>>     schemas as a
>>>      >             first-class concept on
>>>      >             > Beam PCollections. The proposal engendered quite a
>>>     bit of
>>>      >             discussion from the
>>>      >             > community - more discussion than I've seen from
>>>     almost any of our
>>>      >             proposals to
>>>      >             > date!
>>>      >             >
>>>      >             > Based on the feedback and comments, I reworked the
>>>     proposal
>>>      >             document quite a
>>>      >             > bit. It now talks more explicitly about the
>>>     different between
>>>      >             dynamic schemas
>>>      >             > (where the schema is not fully not know at
>>>     graph-creation time),
>>>      >             and static
>>>      >             > schemas (which are fully know at graph-creation
>>>     time). Proposed
>>>      >             APIs are more
>>>      >             > fleshed out now (again thanks to feedback from
>>>     community members),
>>>      >             and the
>>>      >             > document talks in more detail about evolving
>>> schemas in
>>>      >             long-running streaming
>>>      >             > pipelines.
>>>      >             >
>>>      >             > Please take a look. I think this will be very
>>>     valuable to Beam,
>>>      >             and welcome any
>>>      >             > feedback.
>>>      >             >
>>>      >             >
>>>      >
>>>     https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ
>>> 12pHGK0QIvXS1FOTgRc/edit#
>>>     <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm
>>> Q12pHGK0QIvXS1FOTgRc/edit#>
>>>      >                 <https://docs.google.com/docu
>>> ment/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit# <
>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm
>>> Q12pHGK0QIvXS1FOTgRc/edit#>>
>>>      >             >
>>>      >             > Reuven
>>>      >
>>>      >             --
>>>      >             Jean-Baptiste Onofré
>>>      > [email protected] <mailto:[email protected]>
>>>     <mailto:[email protected] <mailto:[email protected]>>
>>>      > http://blog.nanthrax.net
>>>      >             Talend - http://www.talend.com
>>>      >
>>>      >
>>>      >
>>>
>>>     --
>>>     Jean-Baptiste Onofré
>>>     [email protected] <mailto:[email protected]>
>>>     http://blog.nanthrax.net
>>>     Talend - http://www.talend.com
>>>
>>>
>>>
>

Re: Schema-Aware PCollections revisited

Reply via email to