Re: Schema-Aware PCollections revisited

Reuven Lax Mon, 05 Feb 2018 23:25:17 -0800

On Mon, Feb 5, 2018 at 9:06 PM, Kenneth Knowles <k...@google.com> wrote:


> Joining late, but very interested. Commented on the doc. Since there's a
> forked discussion between doc and thread, I want to say this on the thread:
>
> 1. I have used JSON schema in production for describing the structure of
> analytics events and it is OK but not great. If you are sure your data is
> only JSON, use it. For Beam the hierarchical structure is meaningful while
> the atomic pieces should be existing coders. When we integrate with SQL
> that can get more specific.
>

Even if your input data is JSON, you probably don't want Beam's internal
representation to be JSON. Experience shows that this can increase the cost
of a pipeline by an order of magnitude, and in fact is one of the reasons
we removed source coders (users would accidentally set a JSON coder
throughout their pipeline, causing major problems)


>
> 2. Overall, I found the discussion and doc a bit short on use cases. I can
> propose a few:
>

Good call - I'll add a use-cases section.


>
>  - incoming topic of events from clients (at various levels of upgrade /
> schema adherence)
>  - async update of client and pipeline in the above
>  - archive of files that parse to a POJO of known schema, or archive of
> all of the above
>  - SQL integration / columnar operation with all of the above
>  - autogenerated UI integration with all of the above
>
> My impression is that the design will nail SQL integration and
> autogenerated UI but will leave compatibility/evolution concerns for later.
> IMO this is smart as they are much harder.
>

If we care about streaming pipelines, we need some degree of evolution
support (at least "unknown-field" support).


>
> Kenn
>
> On Mon, Feb 5, 2018 at 1:55 PM, Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>> None, Json-p - the spec so no strong impl requires - as record API and a
>> custom light wrapping for schema - like https://github.com/Talend
>> /component-runtime/blob/master/component-form/component-
>> form-model/src/main/java/org/talend/sdk/component/form/
>> model/jsonschema/JsonSchema.java (note this code is used for something
>> else) or a plain JsonObject which should be sufficient.
>>
>> side note: Apache Johnzon would probably be happy to host an enriched
>> schema module based on jsonp if you feel it better this way.
>>
>>
>> Le 5 févr. 2018 21:43, "Reuven Lax" <re...@google.com> a écrit :
>>
>> Which json library are you thinking of? At least in Java, there's always
>> been a problem of no good standard Json library.
>>
>>
>>
>> On Mon, Feb 5, 2018 at 12:03 PM, Romain Manni-Bucau <
>> rmannibu...@gmail.com> wrote:
>>
>>>
>>>
>>> Le 5 févr. 2018 19:54, "Reuven Lax" <re...@google.com> a écrit :
>>>
>>> multiplying by 1.0 doesn't really solve the right problems. The number
>>> type used by Javascript (and by extension, they standard for json) only has
>>> 53 bits of precision. I've seen many, many bugs caused because of this -
>>> the input data may easily contain numbers too large for 53 bits.
>>>
>>>
>>> You have alternative than string at the end whatever schema you use so
>>> not sure it is an issue. At least if runtime is in java or mainstream
>>> languages.
>>>
>>>
>>>
>>> In addition, Beam's schema representation must be no less general than
>>> other common representations. For the case of an ETL pipeline, if input
>>> fields are integers the output fields should also be numbers. We shouldn't
>>> turn them into floats because the schema class we used couldn't distinguish
>>> between ints and floats. If anything, Avro schemas are a better fit here as
>>> they are more general.
>>>
>>>
>>> This is what previous definition does. Avro are not better for 2 reasons:
>>>
>>> 1. Their dep stack is a clear blocker and please dont even speak of yet
>>> another uncontrolled shade in the API. Until avro become an api only and
>>> not an impl this is a bad fit for beam.
>>> 2. They must be json friendly so you are back on json + metada so
>>> jsonschema+extension entry is strictly equivalent and as typed
>>>
>>>
>>>
>>> Reuven
>>>
>>> On Sun, Feb 4, 2018 at 9:31 AM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
>>>> You can handle integers using multipleOf: 1.0 IIRC.
>>>> Yes limitations are still here but it is a good starting model and to
>>>> be honest it is good enough - not a single model will work good enough even
>>>> if you can go a little bit further with other models a bit more complex.
>>>> That said the idea is to enrich the model with a beam object which
>>>> would allow to complete the metadata as required when needed (never?).
>>>>
>>>>
>>>>
>>>> Romain Manni-Bucau
>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>> <http://rmannibucau.wordpress.com> | Github
>>>> <https://github.com/rmannibucau> | LinkedIn
>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>>
>>>> 2018-02-04 18:21 GMT+01:00 Jean-Baptiste Onofré <j...@nanthrax.net>:
>>>>
>>>>> Sorry guys, I was off today. Happy to be part of the party too ;)
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>> On 02/04/2018 06:19 PM, Reuven Lax wrote:
>>>>> > Romain, since you're interested maybe the two of us should put
>>>>> together a
>>>>> > proposal for how to set this things (hints, schema) on PCollections?
>>>>> I don't
>>>>> > think it'll be hard - the previous list thread on hints already
>>>>> agreed on a
>>>>> > general approach, and we would just need to flesh it out.
>>>>> >
>>>>> > BTW in the past when I looked, Json schemas seemed to have some odd
>>>>> limitations
>>>>> > inherited from Javascript (e.g. no distinction between integer and
>>>>> > floating-point types). Is that still true?
>>>>> >
>>>>> > Reuven
>>>>> >
>>>>> > On Sun, Feb 4, 2018 at 9:12 AM, Romain Manni-Bucau <
>>>>> rmannibu...@gmail.com
>>>>> > <mailto:rmannibu...@gmail.com>> wrote:
>>>>> >
>>>>> >
>>>>> >
>>>>> >     2018-02-04 17:53 GMT+01:00 Reuven Lax <re...@google.com
>>>>> >     <mailto:re...@google.com>>:
>>>>> >
>>>>> >
>>>>> >
>>>>> >         On Sun, Feb 4, 2018 at 8:42 AM, Romain Manni-Bucau
>>>>> >         <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>>
>>>>> wrote:
>>>>> >
>>>>> >
>>>>> >             2018-02-04 17:37 GMT+01:00 Reuven Lax <re...@google.com
>>>>> >             <mailto:re...@google.com>>:
>>>>> >
>>>>> >                 I'm not sure where proto comes from here. Proto is
>>>>> one example
>>>>> >                 of a type that has a schema, but only one example.
>>>>> >
>>>>> >                 1. In the initial prototype I want to avoid
>>>>> modifying the
>>>>> >                 PCollection API. So I think it's best to create a
>>>>> special
>>>>> >                 SchemaCoder, and pass the schema into this coder.
>>>>> Later we might
>>>>> >                 targeted APIs for this instead of going through a
>>>>> coder.
>>>>> >                 1.a I don't see what hints have to do with this?
>>>>> >
>>>>> >
>>>>> >             Hints are a way to replace the new API and unify the way
>>>>> to pass
>>>>> >             metadata in beam instead of adding a new custom way each
>>>>> time.
>>>>> >
>>>>> >
>>>>> >         I don't think schema is a hint. But I hear what your saying
>>>>> - hint is a
>>>>> >         type of PCollection metadata as is schema, and we should
>>>>> have a unified
>>>>> >         API for setting such metadata.
>>>>> >
>>>>> >
>>>>> >     :), Ismael pointed me out earlier this week that "hint" had an
>>>>> old meaning
>>>>> >     in beam. My usage is purely the one done in most EE spec (your
>>>>> "metadata" in
>>>>> >     previous answer). But guess we are aligned on the meaning now,
>>>>> just wanted
>>>>> >     to be sure.
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >                 2. BeamSQL already has a generic record type which
>>>>> fits this use
>>>>> >                 case very well (though we might modify it). However
>>>>> as mentioned
>>>>> >                 in the doc, the user is never forced to use this
>>>>> generic record
>>>>> >                 type.
>>>>> >
>>>>> >
>>>>> >             Well yes and not. A type already exists but 1. it is
>>>>> very strictly
>>>>> >             limited (flat/columns only which is very few of what big
>>>>> data SQL
>>>>> >             can do) and 2. it must be aligned on the converge of
>>>>> generic data
>>>>> >             the schema will bring (really read "aligned" as "dropped
>>>>> in favor
>>>>> >             of" - deprecated being a smooth way to do it).
>>>>> >
>>>>> >
>>>>> >         As I said the existing class needs to be modified and
>>>>> extended, and not
>>>>> >         just for this schema us was. It was meant to represent
>>>>> Calcite SQL rows,
>>>>> >         but doesn't quite even do that yet (Calcite supports nested
>>>>> rows).
>>>>> >         However I think it's the right basis to start from.
>>>>> >
>>>>> >
>>>>> >     Agree on the state. Current impl issues I hit (additionally to
>>>>> the nested
>>>>> >     support which would require by itself a kind of visitor
>>>>> solution) are the
>>>>> >     fact to own the schema in the record and handle field by field
>>>>> the
>>>>> >     serialization instead of as a whole which is how it would be
>>>>> handled with a
>>>>> >     schema IMHO.
>>>>> >
>>>>> >     Concretely what I don't want is to do a PoC which works - they
>>>>> all work
>>>>> >     right? and integrate to beam without thinking to a global
>>>>> solution for this
>>>>> >     generic record issue and its schema standardization. This is
>>>>> where Json(-P)
>>>>> >     has a lot of value IMHO but requires a bit more love than just
>>>>> adding schema
>>>>> >     in the model.
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >             So long story short the main work of this schema track
>>>>> is not only
>>>>> >             on using schema in runners and other ways but also
>>>>> starting to make
>>>>> >             beam consistent with itself which is probably the most
>>>>> important
>>>>> >             outcome since it is the user facing side of this work.
>>>>> >
>>>>> >
>>>>> >
>>>>> >                 On Sun, Feb 4, 2018 at 12:22 AM, Romain Manni-Bucau
>>>>> >                 <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com
>>>>> >> wrote:
>>>>> >
>>>>> >                     @Reuven: is the proto only about passing schema
>>>>> or also the
>>>>> >                     generic type?
>>>>> >
>>>>> >                     There are 2.5 topics to solve this issue:
>>>>> >
>>>>> >                     1. How to pass schema
>>>>> >                     1.a. hints?
>>>>> >                     2. What is the generic record type associated to
>>>>> a schema
>>>>> >                     and how to express a schema relatively to it
>>>>> >
>>>>> >                     I would be happy to help on 1.a and 2 somehow if
>>>>> you need.
>>>>> >
>>>>> >                     Le 4 févr. 2018 03:30, "Reuven Lax" <
>>>>> re...@google.com
>>>>> >                     <mailto:re...@google.com>> a écrit :
>>>>> >
>>>>> >                         One more thing. If anyone here has
>>>>> experience with
>>>>> >                         various OSS metadata stores (e.g. Kafka
>>>>> Schema Registry
>>>>> >                         is one example), would you like to
>>>>> collaborate on
>>>>> >                         implementation? I want to make sure that
>>>>> source schemas
>>>>> >                         can be stored in a variety of OSS metadata
>>>>> stores, and
>>>>> >                         be easily pulled into a Beam pipeline.
>>>>> >
>>>>> >                         Reuven
>>>>> >
>>>>> >                         On Sat, Feb 3, 2018 at 6:28 PM, Reuven Lax
>>>>> >                         <re...@google.com <mailto:re...@google.com>>
>>>>> wrote:
>>>>> >
>>>>> >                             Hi all,
>>>>> >
>>>>> >                             If there are no concerns, I would like
>>>>> to start
>>>>> >                             working on a prototype. It's just a
>>>>> prototype, so I
>>>>> >                             don't think it will have the final API
>>>>> (e.g. for the
>>>>> >                             prototype I'm going to avoid change the
>>>>> API of
>>>>> >                             PCollection, and use a "special" Coder
>>>>> instead).
>>>>> >                             Also even once we go beyond prototype,
>>>>> it will be
>>>>> >                             @Experimental for some time, so the API
>>>>> will not be
>>>>> >                             fixed in stone.
>>>>> >
>>>>> >                             Any more comments on this approach
>>>>> before we start
>>>>> >                             implementing a prototype?
>>>>> >
>>>>> >                             Reuven
>>>>> >
>>>>> >                             On Wed, Jan 31, 2018 at 1:12 PM, Romain
>>>>> Manni-Bucau
>>>>> >                             <rmannibu...@gmail.com
>>>>> >                             <mailto:rmannibu...@gmail.com>> wrote:
>>>>> >
>>>>> >                                 If you need help on the json part
>>>>> I'm happy to
>>>>> >                                 help. To give a few hints on what is
>>>>> very
>>>>> >                                 doable: we can add an avro module to
>>>>> johnzon
>>>>> >                                 (asf json{p,b} impl) to back jsonp
>>>>> by avro
>>>>> >                                 (guess it will be one of the first
>>>>> to be asked)
>>>>> >                                 for instance.
>>>>> >
>>>>> >
>>>>> >                                 Romain Manni-Bucau
>>>>> >                                 @rmannibucau <
>>>>> https://twitter.com/rmannibucau> |
>>>>> >                                  Blog <https://rmannibucau.metawerx.
>>>>> net/> | Old
>>>>> >                                 Blog <http://rmannibucau.wordpress.
>>>>> com> | Github
>>>>> >                                 <https://github.com/rmannibucau> |
>>>>> LinkedIn
>>>>> >                                 <https://www.linkedin.com/in/
>>>>> rmannibucau>
>>>>> >
>>>>> >                                 2018-01-31 22:06 GMT+01:00 Reuven Lax
>>>>> >                                 <re...@google.com <mailto:
>>>>> re...@google.com>>:
>>>>> >
>>>>> >                                     Agree. The initial
>>>>> implementation will be a
>>>>> >                                     prototype.
>>>>> >
>>>>> >                                     On Wed, Jan 31, 2018 at 12:21 PM,
>>>>> >                                     Jean-Baptiste Onofré <
>>>>> j...@nanthrax.net
>>>>> >                                     <mailto:j...@nanthrax.net>> wrote:
>>>>> >
>>>>> >                                         Hi Reuven,
>>>>> >
>>>>> >                                         Agree to be able to describe
>>>>> the schema
>>>>> >                                         with different format. The
>>>>> good point
>>>>> >                                         about json schemas is that
>>>>> they are
>>>>> >                                         described by a spec. My
>>>>> point is also to
>>>>> >                                         avoid the reinvent the
>>>>> wheel. Just an
>>>>> >                                         abstract to be able to use
>>>>> Avro, Json,
>>>>> >                                         Calcite, custom schema
>>>>> descriptors would
>>>>> >                                         be great.
>>>>> >
>>>>> >                                         Using coder to describe a
>>>>> schema sounds
>>>>> >                                         like a smart move to
>>>>> implement quickly.
>>>>> >                                         However, it has to be clear
>>>>> in term of
>>>>> >                                         documentation to avoid "side
>>>>> effect". I
>>>>> >                                         still think
>>>>> PCollection.setSchema() is
>>>>> >                                         better: it should be
>>>>> metadata (or hint
>>>>> >                                         ;))) on the PCollection.
>>>>> >
>>>>> >                                         Regards
>>>>> >                                         JB
>>>>> >
>>>>> >                                         On 31/01/2018 20:16, Reuven
>>>>> Lax wrote:
>>>>> >
>>>>> >                                             As to the question of
>>>>> how a schema
>>>>> >                                             should be specified, I
>>>>> want to
>>>>> >                                             support several common
>>>>> schema
>>>>> >                                             formats. So if a user
>>>>> has a Json
>>>>> >                                             schema, or an Avro
>>>>> schema, or a
>>>>> >                                             Calcite schema, etc.
>>>>> there should be
>>>>> >                                             adapters that allow
>>>>> setting a schema
>>>>> >                                             from any of them. I
>>>>> don't think we
>>>>> >                                             should prefer one over
>>>>> the other.
>>>>> >                                             While Romain is right
>>>>> that many
>>>>> >                                             people know Json, I
>>>>> think far fewer
>>>>> >                                             people know Json schemas.
>>>>> >
>>>>> >                                             Agree, schemas should
>>>>> not be
>>>>> >                                             enforced (for one thing,
>>>>> that
>>>>> >                                             wouldn't be backwards
>>>>> compatible!).
>>>>> >                                             I think for the initial
>>>>> prototype I
>>>>> >                                             will probably use a
>>>>> special coder to
>>>>> >                                             represent the schema
>>>>> (with setSchema
>>>>> >                                             an option on the coder),
>>>>> largely
>>>>> >                                             because it doesn't
>>>>> require modifying
>>>>> >                                             PCollection. However I
>>>>> think longer
>>>>> >                                             term a schema should be
>>>>> an optional
>>>>> >                                             piece of metadata on the
>>>>> PCollection
>>>>> >                                             object. Similar to the
>>>>> previous
>>>>> >                                             discussion about
>>>>> "hints," I think
>>>>> >                                             this can be set on the
>>>>> producing
>>>>> >                                             PTransform, and a
>>>>> SetSchema
>>>>> >                                             PTransform will allow
>>>>> attaching a
>>>>> >                                             schema to any
>>>>> PCollection (i.e.
>>>>> >
>>>>>  pc.apply(SetSchema.of(schema))).
>>>>> >                                             This part isn't designed
>>>>> yet, but I
>>>>> >                                             think schema should be
>>>>> similar to
>>>>> >                                             hints, it's just another
>>>>> piece of
>>>>> >                                             metadata on the
>>>>> PCollection (though
>>>>> >                                             something interpreted by
>>>>> the model,
>>>>> >                                             where hints are
>>>>> interpreted by the
>>>>> >                                             runner)
>>>>> >
>>>>> >                                             Reuven
>>>>> >
>>>>> >                                             On Tue, Jan 30, 2018 at
>>>>> 1:37 AM,
>>>>> >                                             Jean-Baptiste Onofré
>>>>> >                                             <j...@nanthrax.net
>>>>> >                                             <mailto:j...@nanthrax.net>
>>>>> >                                             <mailto:j...@nanthrax.net
>>>>> >                                             <mailto:j...@nanthrax.net>>>
>>>>> wrote:
>>>>> >
>>>>> >                                                 Hi,
>>>>> >
>>>>> >                                                 I think we should
>>>>> avoid to mix
>>>>> >                                             two things in the
>>>>> discussion (and so
>>>>> >                                                 the document):
>>>>> >
>>>>> >                                                 1. The element of
>>>>> the collection
>>>>> >                                             and the schema itself
>>>>> are two
>>>>> >                                                 different things.
>>>>> >                                                 By essence, Beam
>>>>> should not
>>>>> >                                             enforce any schema.
>>>>> That's why I think
>>>>> >                                                 it's a good
>>>>> >                                                 idea to set the
>>>>> schema
>>>>> >                                             optionally on the
>>>>> PCollection
>>>>> >
>>>>> (pcollection.setSchema()).
>>>>> >
>>>>> >                                                 2. From point 1
>>>>> comes two
>>>>> >                                             questions: how do we
>>>>> represent a
>>>>> >                                             schema ?
>>>>> >                                                 How can we
>>>>> >                                                 leverage the schema
>>>>> to simplify
>>>>> >                                             the serialization of the
>>>>> element in the
>>>>> >                                                 PCollection and
>>>>> query ? These
>>>>> >                                             two questions are not
>>>>> directly related.
>>>>> >
>>>>> >                                                   2.1 How do we
>>>>> represent the schema
>>>>> >                                                 Json Schema is a very
>>>>> >                                             interesting idea. It
>>>>> could be an
>>>>> >                                             abstract and
>>>>> >                                                 other
>>>>> >                                                 providers, like
>>>>> Avro, can be
>>>>> >                                             bind on it. It's part of
>>>>> the json
>>>>> >                                                 processing spec
>>>>> >                                                 (javax).
>>>>> >
>>>>> >                                                   2.2. How do we
>>>>> leverage the
>>>>> >                                             schema for query and
>>>>> serialization
>>>>> >                                                 Also in the spec,
>>>>> json pointer
>>>>> >                                             is interesting for the
>>>>> querying.
>>>>> >                                                 Regarding the
>>>>> >                                                 serialization,
>>>>> jackson or other
>>>>> >                                             data binder can be used.
>>>>> >
>>>>> >                                                 It's still rough
>>>>> ideas in my
>>>>> >                                             mind, but I like
>>>>> Romain's idea about
>>>>> >                                                 json-p usage.
>>>>> >
>>>>> >                                                 Once 2.3.0 release
>>>>> is out, I
>>>>> >                                             will start to update the
>>>>> document with
>>>>> >                                                 those ideas,
>>>>> >                                                 and PoC.
>>>>> >
>>>>> >                                                 Thanks !
>>>>> >                                                 Regards
>>>>> >                                                 JB
>>>>> >
>>>>> >                                                 On 01/30/2018 08:42
>>>>> AM, Romain
>>>>> >                                             Manni-Bucau wrote:
>>>>> >                                                 >
>>>>> >                                                 >
>>>>> >                                                 > Le 30 janv. 2018
>>>>> 01:09,
>>>>> >                                             "Reuven Lax" <
>>>>> re...@google.com
>>>>> >                                             <mailto:re...@google.com
>>>>> >
>>>>> >                                             <mailto:re...@google.com
>>>>> >                                             <mailto:re...@google.com
>>>>> >>
>>>>> >                                                  > <mailto:
>>>>> re...@google.com
>>>>> >                                             <mailto:re...@google.com
>>>>> >
>>>>> >                                             <mailto:re...@google.com
>>>>> >                                             <mailto:re...@google.com>>>>
>>>>> a écrit :
>>>>> >                                                 >
>>>>> >                                                 >
>>>>> >                                                 >
>>>>> >                                                 >     On Mon, Jan
>>>>> 29, 2018 at
>>>>> >                                             12:17 PM, Romain
>>>>> Manni-Bucau
>>>>> >                                             <rmannibu...@gmail.com
>>>>> >                                             <mailto:
>>>>> rmannibu...@gmail.com>
>>>>> >                                             <mailto:
>>>>> rmannibu...@gmail.com
>>>>> >                                             <mailto:
>>>>> rmannibu...@gmail.com>>
>>>>> >                                                  >
>>>>> >                                              <mailto:
>>>>> rmannibu...@gmail.com
>>>>> >                                             <mailto:
>>>>> rmannibu...@gmail.com>
>>>>> >
>>>>> >                                                 <mailto:
>>>>> rmannibu...@gmail.com
>>>>> >                                             <mailto:
>>>>> rmannibu...@gmail.com>>>> wrote:
>>>>> >                                                  >
>>>>> >                                                  >         Hi
>>>>> >                                                  >
>>>>> >                                                  >         I have
>>>>> some questions
>>>>> >                                             on this: how hierarchic
>>>>> schemas
>>>>> >                                                 would work? Seems
>>>>> >                                                  >         it is not
>>>>> really
>>>>> >                                             supported by the
>>>>> ecosystem (out of
>>>>> >                                                 custom stuff) :(.
>>>>> >                                                  >         How would
>>>>> it
>>>>> >                                             integrate smoothly with
>>>>> other
>>>>> >                                             generic record
>>>>> >                                                 types - N bridges?
>>>>> >                                                  >
>>>>> >                                                  >
>>>>> >                                                  >     Do you mean
>>>>> nested
>>>>> >                                             schemas? What do you
>>>>> mean here?
>>>>> >                                                  >
>>>>> >                                                  >
>>>>> >                                                  > Yes, sorry -
>>>>> wrote the mail
>>>>> >                                             too late ;). Was
>>>>> hierarchic data and
>>>>> >                                                 nested schemas.
>>>>> >                                                  >
>>>>> >                                                  >
>>>>> >                                                  >
>>>>>  Concretely I wonder
>>>>> >                                             if using json API
>>>>> couldnt be
>>>>> >                                                 beneficial: json-p
>>>>> is a
>>>>> >                                                  >         nice
>>>>> generic
>>>>> >                                             abstraction with a built
>>>>> in querying
>>>>> >                                                 mecanism
>>>>> (jsonpointer)
>>>>> >                                                  >         but no
>>>>> actual
>>>>> >                                             serialization (even if
>>>>> json and
>>>>> >                                             binary json
>>>>> >                                                 are very
>>>>> >                                                  >         natural).
>>>>> The big
>>>>> >                                             advantage is to have a
>>>>> well known
>>>>> >                                                 ecosystem - who
>>>>> >                                                  >         doesnt
>>>>> know json
>>>>> >                                             today? - that beam can
>>>>> reuse for free:
>>>>> >                                                 JsonObject
>>>>> >                                                  >         (guess we
>>>>> dont want
>>>>> >                                             JsonValue abstraction)
>>>>> for the record
>>>>> >                                                 type,
>>>>> >                                                  >
>>>>>  jsonschema standard
>>>>> >                                             for the schema,
>>>>> jsonpointer for the
>>>>> >                                                  >
>>>>>  delection/projection
>>>>> >                                             etc... It doesnt enforce
>>>>> the actual
>>>>> >                                                 serialization
>>>>> >                                                  >         (json,
>>>>> smile, avro,
>>>>> >                                             ...) but provide an
>>>>> expressive and
>>>>> >                                                 alread known API
>>>>> >                                                  >         so i see
>>>>> it as a big
>>>>> >                                             win-win for users (no
>>>>> need to learn
>>>>> >                                                 a new API and
>>>>> >                                                  >         use N
>>>>> bridges in all
>>>>> >                                             ways) and beam (impls
>>>>> are here and
>>>>> >                                                 API design
>>>>> >                                                  >         already
>>>>> thought).
>>>>> >                                                  >
>>>>> >                                                  >
>>>>> >                                                  >     I assume
>>>>> you're talking
>>>>> >                                             about the API for
>>>>> setting schemas,
>>>>> >                                                 not using them.
>>>>> >                                                  >     Json has many
>>>>> downsides
>>>>> >                                             and I'm not sure it's
>>>>> true that
>>>>> >                                                 everyone knows it;
>>>>> >                                                  >     there are
>>>>> also competing
>>>>> >                                             schema APIs, such as
>>>>> Avro etc..
>>>>> >                                                 However I think we
>>>>> >                                                  >     should give
>>>>> Json a fair
>>>>> >                                             evaluation before
>>>>> dismissing it.
>>>>> >                                                  >
>>>>> >                                                  >
>>>>> >                                                  > It is a wider
>>>>> topic than
>>>>> >                                             schema. Actually schema
>>>>> are not the
>>>>> >                                                 first citizen but a
>>>>> >                                                  > generic data
>>>>> representation
>>>>> >                                             is. That is where json
>>>>> hits almost
>>>>> >                                                 any other API.
>>>>> >                                                  > Then, when it
>>>>> comes to
>>>>> >                                             schema, json has a
>>>>> standard for that
>>>>> >                                             so we
>>>>> >                                                 are all good.
>>>>> >                                                  >
>>>>> >                                                  > Also json has a
>>>>> good indexing
>>>>> >                                             API compared to
>>>>> alternatives which
>>>>> >                                                 are sometimes a
>>>>> >                                                  > bit faster - for
>>>>> noop
>>>>> >                                             transforms - but are
>>>>> hardly usable
>>>>> >                                             or make
>>>>> >                                                 the code not
>>>>> >                                                  > that readable.
>>>>> >                                                  >
>>>>> >                                                  > Avro is a nice
>>>>> competitor but
>>>>> >                                             it is compatible -
>>>>> actually avro is
>>>>> >                                                 json driven by
>>>>> >                                                  > design - but its
>>>>> API is far
>>>>> >                                             to be that easy due to
>>>>> its schema
>>>>> >                                                 enforcement which
>>>>> >                                                  > is heavvvyyy and
>>>>> worse is you
>>>>> >                                             cant work with avro
>>>>> without a
>>>>> >                                                 schema. Json would
>>>>> >                                                  > allow to
>>>>> reconciliate the
>>>>> >                                             dynamic and static cases
>>>>> since the job
>>>>> >                                                 wouldnt change
>>>>> >                                                  > except the
>>>>> setschema.
>>>>> >                                                  >
>>>>> >                                                  > That is why I
>>>>> think json is a
>>>>> >                                             good compromise and
>>>>> having a
>>>>> >                                                 standard API for it
>>>>> >                                                  > allow to fully
>>>>> customize the
>>>>> >                                             imol as will if needed -
>>>>> even using
>>>>> >                                                 avro or protobuf.
>>>>> >                                                  >
>>>>> >                                                  > Side note on beam
>>>>> api: i dont
>>>>> >                                             think it is good to use
>>>>> a main API
>>>>> >                                                 for runner
>>>>> >                                                  > optimization. It
>>>>> enforces
>>>>> >                                             something to be shared
>>>>> on all runners
>>>>> >                                                 but not widely
>>>>> >                                                  > usable. It is
>>>>> also misleading
>>>>> >                                             for users. Would you set
>>>>> a flink
>>>>> >                                                 pipeline option
>>>>> >                                                  > with dataflow? My
>>>>> proposal
>>>>> >                                             here is to use hints -
>>>>> properties -
>>>>> >                                                 instead of
>>>>> >                                                  > something hardly
>>>>> defined in
>>>>> >                                             the API then standardize
>>>>> it if all
>>>>> >                                                 runners support it.
>>>>> >                                                  >
>>>>> >                                                  >
>>>>> >                                                  >
>>>>> >                                                  >         Wdyt?
>>>>> >                                                  >
>>>>> >                                                  >         Le 29
>>>>> janv. 2018
>>>>> >                                             06:24, "Jean-Baptiste
>>>>> Onofré"
>>>>> >                                                 <j...@nanthrax.net
>>>>> >                                             <mailto:j...@nanthrax.net>
>>>>> >                                             <mailto:j...@nanthrax.net
>>>>> >                                             <mailto:j...@nanthrax.net
>>>>> >>
>>>>> >                                                  >
>>>>> >                                              <mailto:j...@nanthrax.net
>>>>> >                                             <mailto:j...@nanthrax.net>
>>>>> >                                             <mailto:j...@nanthrax.net
>>>>> >                                             
>>>>> > <mailto:j...@nanthrax.net>>>>
>>>>> a écrit :
>>>>> >
>>>>> >                                                  >
>>>>> >                                                  >             Hi
>>>>> Reuven,
>>>>> >                                                  >
>>>>> >                                                  >
>>>>>  Thanks for the
>>>>> >                                             update ! As I'm working
>>>>> with you on
>>>>> >                                                 this, I fully
>>>>> >                                                  >             agree
>>>>> and great
>>>>> >                                                  >             doc
>>>>> gathering the
>>>>> >                                             ideas.
>>>>> >                                                  >
>>>>> >                                                  >             It's
>>>>> clearly
>>>>> >                                             something we have to add
>>>>> asap in Beam,
>>>>> >                                                 because it would
>>>>> >                                                  >             allow
>>>>> new
>>>>> >                                                  >             use
>>>>> cases for our
>>>>> >                                             users (in a simple way)
>>>>> and open
>>>>> >                                                 new areas for the
>>>>> >                                                  >
>>>>>  runners
>>>>> >                                                  >             (for
>>>>> instance
>>>>> >                                             dataframe support in the
>>>>> Spark runner).
>>>>> >                                                  >
>>>>> >                                                  >             By
>>>>> the way, while
>>>>> >                                             ago, I created BEAM-3437
>>>>> to track
>>>>> >                                                 the PoC/PR
>>>>> >                                                  >
>>>>>  around this.
>>>>> >                                                  >
>>>>> >                                                  >
>>>>>  Thanks !
>>>>> >                                                  >
>>>>> >                                                  >
>>>>>  Regards
>>>>> >                                                  >             JB
>>>>> >                                                  >
>>>>> >                                                  >             On
>>>>> 01/29/2018
>>>>> >                                             02:08 AM, Reuven Lax
>>>>> wrote:
>>>>> >                                                  >             >
>>>>> Previously I
>>>>> >                                             submitted a proposal for
>>>>> adding
>>>>> >                                                 schemas as a
>>>>> >                                                  >
>>>>>  first-class
>>>>> >                                             concept on
>>>>> >                                                  >             > Beam
>>>>> >                                             PCollections. The
>>>>> proposal
>>>>> >                                             engendered quite a
>>>>> >                                                 bit of
>>>>> >                                                  >
>>>>>  discussion from the
>>>>> >                                                  >             >
>>>>> community -
>>>>> >                                             more discussion than
>>>>> I've seen from
>>>>> >                                                 almost any of our
>>>>> >                                                  >
>>>>>  proposals to
>>>>> >                                                  >             >
>>>>> date!
>>>>> >                                                  >             >
>>>>> >                                                  >             >
>>>>> Based on the
>>>>> >                                             feedback and comments, I
>>>>> reworked the
>>>>> >                                                 proposal
>>>>> >                                                  >
>>>>>  document quite a
>>>>> >                                                  >             >
>>>>> bit. It now
>>>>> >                                             talks more explicitly
>>>>> about the
>>>>> >                                                 different between
>>>>> >                                                  >
>>>>>  dynamic schemas
>>>>> >                                                  >             >
>>>>> (where the
>>>>> >                                             schema is not fully not
>>>>> know at
>>>>> >                                                 graph-creation time),
>>>>> >                                                  >             and
>>>>> static
>>>>> >                                                  >             >
>>>>> schemas (which
>>>>> >                                             are fully know at
>>>>> graph-creation
>>>>> >                                                 time). Proposed
>>>>> >                                                  >             APIs
>>>>> are more
>>>>> >                                                  >             >
>>>>> fleshed out now
>>>>> >                                             (again thanks to
>>>>> feedback from
>>>>> >                                                 community members),
>>>>> >                                                  >             and
>>>>> the
>>>>> >                                                  >             >
>>>>> document talks
>>>>> >                                             in more detail about
>>>>> evolving schemas in
>>>>> >                                                  >
>>>>>  long-running
>>>>> >                                             streaming
>>>>> >                                                  >             >
>>>>> pipelines.
>>>>> >                                                  >             >
>>>>> >                                                  >             >
>>>>> Please take a
>>>>> >                                             look. I think this will
>>>>> be very
>>>>> >                                                 valuable to Beam,
>>>>> >                                                  >             and
>>>>> welcome any
>>>>> >                                                  >             >
>>>>> feedback.
>>>>> >                                                  >             >
>>>>> >                                                  >             >
>>>>> >                                                  >
>>>>> >
>>>>> >
>>>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm
>>>>> Q12pHGK0QIvXS1FOTgRc/edit#
>>>>> >                                             <
>>>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruU
>>>>> mQ12pHGK0QIvXS1FOTgRc/edit#>
>>>>> >
>>>>> >                                             <
>>>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruU
>>>>> mQ12pHGK0QIvXS1FOTgRc/edit#
>>>>> >                                             <
>>>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruU
>>>>> mQ12pHGK0QIvXS1FOTgRc/edit#>>
>>>>> >                                                  >
>>>>> >                                              <
>>>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXru
>>>>> UmQ12pHGK0QIvXS1FOTgRc/edit#
>>>>> >                                             <
>>>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruU
>>>>> mQ12pHGK0QIvXS1FOTgRc/edit#>
>>>>> >                                             <
>>>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruU
>>>>> mQ12pHGK0QIvXS1FOTgRc/edit#
>>>>> >                                             <
>>>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruU
>>>>> mQ12pHGK0QIvXS1FOTgRc/edit#>>>
>>>>> >                                                  >             >
>>>>> >                                                  >             >
>>>>> Reuven
>>>>> >                                                  >
>>>>> >                                                  >             --
>>>>> >                                                  >
>>>>>  Jean-Baptiste Onofré
>>>>> >                                                  >
>>>>> jbono...@apache.org
>>>>> >                                             <mailto:
>>>>> jbono...@apache.org>
>>>>> >                                             <mailto:
>>>>> jbono...@apache.org
>>>>> >                                             <mailto:
>>>>> jbono...@apache.org>>
>>>>> >                                                 <mailto:
>>>>> jbono...@apache.org
>>>>> >                                             <mailto:
>>>>> jbono...@apache.org>
>>>>> >                                             <mailto:
>>>>> jbono...@apache.org
>>>>> >                                             <mailto:
>>>>> jbono...@apache.org>>>
>>>>> >                                                  >
>>>>> http://blog.nanthrax.net
>>>>> >                                                  >
>>>>>  Talend -
>>>>> >                                             http://www.talend.com
>>>>> >                                                  >
>>>>> >                                                  >
>>>>> >                                                  >
>>>>> >
>>>>> >                                                 --
>>>>> >                                                 Jean-Baptiste Onofré
>>>>> >                                                 jbono...@apache.org
>>>>> >                                             <mailto:
>>>>> jbono...@apache.org>
>>>>> >                                             <mailto:
>>>>> jbono...@apache.org
>>>>> >                                             <mailto:
>>>>> jbono...@apache.org>>
>>>>> >
>>>>> http://blog.nanthrax.net
>>>>> >                                                 Talend -
>>>>> http://www.talend.com
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>>
>>>>> --
>>>>> Jean-Baptiste Onofré
>>>>> jbono...@apache.org
>>>>> http://blog.nanthrax.net
>>>>> Talend - http://www.talend.com
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
>

Re: Schema-Aware PCollections revisited

Reply via email to