Re: Schema-Aware PCollections revisited

Reuven Lax Mon, 05 Feb 2018 12:43:55 -0800

Which json library are you thinking of? At least in Java, there's always
been a problem of no good standard Json library.




On Mon, Feb 5, 2018 at 12:03 PM, Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

>
>
> Le 5 févr. 2018 19:54, "Reuven Lax" <re...@google.com> a écrit :
>
> multiplying by 1.0 doesn't really solve the right problems. The number
> type used by Javascript (and by extension, they standard for json) only has
> 53 bits of precision. I've seen many, many bugs caused because of this -
> the input data may easily contain numbers too large for 53 bits.
>
>
> You have alternative than string at the end whatever schema you use so not
> sure it is an issue. At least if runtime is in java or mainstream languages.
>
>
>
> In addition, Beam's schema representation must be no less general than
> other common representations. For the case of an ETL pipeline, if input
> fields are integers the output fields should also be numbers. We shouldn't
> turn them into floats because the schema class we used couldn't distinguish
> between ints and floats. If anything, Avro schemas are a better fit here as
> they are more general.
>
>
> This is what previous definition does. Avro are not better for 2 reasons:
>
> 1. Their dep stack is a clear blocker and please dont even speak of yet
> another uncontrolled shade in the API. Until avro become an api only and
> not an impl this is a bad fit for beam.
> 2. They must be json friendly so you are back on json + metada so
> jsonschema+extension entry is strictly equivalent and as typed
>
>
>
> Reuven
>
> On Sun, Feb 4, 2018 at 9:31 AM, Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>> You can handle integers using multipleOf: 1.0 IIRC.
>> Yes limitations are still here but it is a good starting model and to be
>> honest it is good enough - not a single model will work good enough even if
>> you can go a little bit further with other models a bit more complex.
>> That said the idea is to enrich the model with a beam object which would
>> allow to complete the metadata as required when needed (never?).
>>
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau> | Book
>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>
>> 2018-02-04 18:21 GMT+01:00 Jean-Baptiste Onofré <j...@nanthrax.net>:
>>
>>> Sorry guys, I was off today. Happy to be part of the party too ;)
>>>
>>> Regards
>>> JB
>>>
>>> On 02/04/2018 06:19 PM, Reuven Lax wrote:
>>> > Romain, since you're interested maybe the two of us should put
>>> together a
>>> > proposal for how to set this things (hints, schema) on PCollections? I
>>> don't
>>> > think it'll be hard - the previous list thread on hints already agreed
>>> on a
>>> > general approach, and we would just need to flesh it out.
>>> >
>>> > BTW in the past when I looked, Json schemas seemed to have some odd
>>> limitations
>>> > inherited from Javascript (e.g. no distinction between integer and
>>> > floating-point types). Is that still true?
>>> >
>>> > Reuven
>>> >
>>> > On Sun, Feb 4, 2018 at 9:12 AM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com
>>> > <mailto:rmannibu...@gmail.com>> wrote:
>>> >
>>> >
>>> >
>>> >     2018-02-04 17:53 GMT+01:00 Reuven Lax <re...@google.com
>>> >     <mailto:re...@google.com>>:
>>> >
>>> >
>>> >
>>> >         On Sun, Feb 4, 2018 at 8:42 AM, Romain Manni-Bucau
>>> >         <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>> wrote:
>>> >
>>> >
>>> >             2018-02-04 17:37 GMT+01:00 Reuven Lax <re...@google.com
>>> >             <mailto:re...@google.com>>:
>>> >
>>> >                 I'm not sure where proto comes from here. Proto is one
>>> example
>>> >                 of a type that has a schema, but only one example.
>>> >
>>> >                 1. In the initial prototype I want to avoid modifying
>>> the
>>> >                 PCollection API. So I think it's best to create a
>>> special
>>> >                 SchemaCoder, and pass the schema into this coder.
>>> Later we might
>>> >                 targeted APIs for this instead of going through a
>>> coder.
>>> >                 1.a I don't see what hints have to do with this?
>>> >
>>> >
>>> >             Hints are a way to replace the new API and unify the way
>>> to pass
>>> >             metadata in beam instead of adding a new custom way each
>>> time.
>>> >
>>> >
>>> >         I don't think schema is a hint. But I hear what your saying -
>>> hint is a
>>> >         type of PCollection metadata as is schema, and we should have
>>> a unified
>>> >         API for setting such metadata.
>>> >
>>> >
>>> >     :), Ismael pointed me out earlier this week that "hint" had an old
>>> meaning
>>> >     in beam. My usage is purely the one done in most EE spec (your
>>> "metadata" in
>>> >     previous answer). But guess we are aligned on the meaning now,
>>> just wanted
>>> >     to be sure.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >                 2. BeamSQL already has a generic record type which
>>> fits this use
>>> >                 case very well (though we might modify it). However as
>>> mentioned
>>> >                 in the doc, the user is never forced to use this
>>> generic record
>>> >                 type.
>>> >
>>> >
>>> >             Well yes and not. A type already exists but 1. it is very
>>> strictly
>>> >             limited (flat/columns only which is very few of what big
>>> data SQL
>>> >             can do) and 2. it must be aligned on the converge of
>>> generic data
>>> >             the schema will bring (really read "aligned" as "dropped
>>> in favor
>>> >             of" - deprecated being a smooth way to do it).
>>> >
>>> >
>>> >         As I said the existing class needs to be modified and
>>> extended, and not
>>> >         just for this schema us was. It was meant to represent Calcite
>>> SQL rows,
>>> >         but doesn't quite even do that yet (Calcite supports nested
>>> rows).
>>> >         However I think it's the right basis to start from.
>>> >
>>> >
>>> >     Agree on the state. Current impl issues I hit (additionally to the
>>> nested
>>> >     support which would require by itself a kind of visitor solution)
>>> are the
>>> >     fact to own the schema in the record and handle field by field the
>>> >     serialization instead of as a whole which is how it would be
>>> handled with a
>>> >     schema IMHO.
>>> >
>>> >     Concretely what I don't want is to do a PoC which works - they all
>>> work
>>> >     right? and integrate to beam without thinking to a global solution
>>> for this
>>> >     generic record issue and its schema standardization. This is where
>>> Json(-P)
>>> >     has a lot of value IMHO but requires a bit more love than just
>>> adding schema
>>> >     in the model.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >             So long story short the main work of this schema track is
>>> not only
>>> >             on using schema in runners and other ways but also
>>> starting to make
>>> >             beam consistent with itself which is probably the most
>>> important
>>> >             outcome since it is the user facing side of this work.
>>> >
>>> >
>>> >
>>> >                 On Sun, Feb 4, 2018 at 12:22 AM, Romain Manni-Bucau
>>> >                 <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>>
>>> wrote:
>>> >
>>> >                     @Reuven: is the proto only about passing schema or
>>> also the
>>> >                     generic type?
>>> >
>>> >                     There are 2.5 topics to solve this issue:
>>> >
>>> >                     1. How to pass schema
>>> >                     1.a. hints?
>>> >                     2. What is the generic record type associated to a
>>> schema
>>> >                     and how to express a schema relatively to it
>>> >
>>> >                     I would be happy to help on 1.a and 2 somehow if
>>> you need.
>>> >
>>> >                     Le 4 févr. 2018 03:30, "Reuven Lax" <
>>> re...@google.com
>>> >                     <mailto:re...@google.com>> a écrit :
>>> >
>>> >                         One more thing. If anyone here has experience
>>> with
>>> >                         various OSS metadata stores (e.g. Kafka Schema
>>> Registry
>>> >                         is one example), would you like to collaborate
>>> on
>>> >                         implementation? I want to make sure that
>>> source schemas
>>> >                         can be stored in a variety of OSS metadata
>>> stores, and
>>> >                         be easily pulled into a Beam pipeline.
>>> >
>>> >                         Reuven
>>> >
>>> >                         On Sat, Feb 3, 2018 at 6:28 PM, Reuven Lax
>>> >                         <re...@google.com <mailto:re...@google.com>>
>>> wrote:
>>> >
>>> >                             Hi all,
>>> >
>>> >                             If there are no concerns, I would like to
>>> start
>>> >                             working on a prototype. It's just a
>>> prototype, so I
>>> >                             don't think it will have the final API
>>> (e.g. for the
>>> >                             prototype I'm going to avoid change the
>>> API of
>>> >                             PCollection, and use a "special" Coder
>>> instead).
>>> >                             Also even once we go beyond prototype, it
>>> will be
>>> >                             @Experimental for some time, so the API
>>> will not be
>>> >                             fixed in stone.
>>> >
>>> >                             Any more comments on this approach before
>>> we start
>>> >                             implementing a prototype?
>>> >
>>> >                             Reuven
>>> >
>>> >                             On Wed, Jan 31, 2018 at 1:12 PM, Romain
>>> Manni-Bucau
>>> >                             <rmannibu...@gmail.com
>>> >                             <mailto:rmannibu...@gmail.com>> wrote:
>>> >
>>> >                                 If you need help on the json part I'm
>>> happy to
>>> >                                 help. To give a few hints on what is
>>> very
>>> >                                 doable: we can add an avro module to
>>> johnzon
>>> >                                 (asf json{p,b} impl) to back jsonp by
>>> avro
>>> >                                 (guess it will be one of the first to
>>> be asked)
>>> >                                 for instance.
>>> >
>>> >
>>> >                                 Romain Manni-Bucau
>>> >                                 @rmannibucau <
>>> https://twitter.com/rmannibucau> |
>>> >                                  Blog <https://rmannibucau.metawerx.
>>> net/> | Old
>>> >                                 Blog <http://rmannibucau.wordpress.com>
>>> | Github
>>> >                                 <https://github.com/rmannibucau> |
>>> LinkedIn
>>> >                                 <https://www.linkedin.com/in/
>>> rmannibucau>
>>> >
>>> >                                 2018-01-31 22:06 GMT+01:00 Reuven Lax
>>> >                                 <re...@google.com <mailto:
>>> re...@google.com>>:
>>> >
>>> >                                     Agree. The initial implementation
>>> will be a
>>> >                                     prototype.
>>> >
>>> >                                     On Wed, Jan 31, 2018 at 12:21 PM,
>>> >                                     Jean-Baptiste Onofré <
>>> j...@nanthrax.net
>>> >                                     <mailto:j...@nanthrax.net>> wrote:
>>> >
>>> >                                         Hi Reuven,
>>> >
>>> >                                         Agree to be able to describe
>>> the schema
>>> >                                         with different format. The
>>> good point
>>> >                                         about json schemas is that
>>> they are
>>> >                                         described by a spec. My point
>>> is also to
>>> >                                         avoid the reinvent the wheel.
>>> Just an
>>> >                                         abstract to be able to use
>>> Avro, Json,
>>> >                                         Calcite, custom schema
>>> descriptors would
>>> >                                         be great.
>>> >
>>> >                                         Using coder to describe a
>>> schema sounds
>>> >                                         like a smart move to implement
>>> quickly.
>>> >                                         However, it has to be clear in
>>> term of
>>> >                                         documentation to avoid "side
>>> effect". I
>>> >                                         still think
>>> PCollection.setSchema() is
>>> >                                         better: it should be metadata
>>> (or hint
>>> >                                         ;))) on the PCollection.
>>> >
>>> >                                         Regards
>>> >                                         JB
>>> >
>>> >                                         On 31/01/2018 20:16, Reuven
>>> Lax wrote:
>>> >
>>> >                                             As to the question of how
>>> a schema
>>> >                                             should be specified, I
>>> want to
>>> >                                             support several common
>>> schema
>>> >                                             formats. So if a user has
>>> a Json
>>> >                                             schema, or an Avro schema,
>>> or a
>>> >                                             Calcite schema, etc. there
>>> should be
>>> >                                             adapters that allow
>>> setting a schema
>>> >                                             from any of them. I don't
>>> think we
>>> >                                             should prefer one over the
>>> other.
>>> >                                             While Romain is right that
>>> many
>>> >                                             people know Json, I think
>>> far fewer
>>> >                                             people know Json schemas.
>>> >
>>> >                                             Agree, schemas should not
>>> be
>>> >                                             enforced (for one thing,
>>> that
>>> >                                             wouldn't be backwards
>>> compatible!).
>>> >                                             I think for the initial
>>> prototype I
>>> >                                             will probably use a
>>> special coder to
>>> >                                             represent the schema (with
>>> setSchema
>>> >                                             an option on the coder),
>>> largely
>>> >                                             because it doesn't require
>>> modifying
>>> >                                             PCollection. However I
>>> think longer
>>> >                                             term a schema should be an
>>> optional
>>> >                                             piece of metadata on the
>>> PCollection
>>> >                                             object. Similar to the
>>> previous
>>> >                                             discussion about "hints,"
>>> I think
>>> >                                             this can be set on the
>>> producing
>>> >                                             PTransform, and a SetSchema
>>> >                                             PTransform will allow
>>> attaching a
>>> >                                             schema to any PCollection
>>> (i.e.
>>> >
>>>  pc.apply(SetSchema.of(schema))).
>>> >                                             This part isn't designed
>>> yet, but I
>>> >                                             think schema should be
>>> similar to
>>> >                                             hints, it's just another
>>> piece of
>>> >                                             metadata on the
>>> PCollection (though
>>> >                                             something interpreted by
>>> the model,
>>> >                                             where hints are
>>> interpreted by the
>>> >                                             runner)
>>> >
>>> >                                             Reuven
>>> >
>>> >                                             On Tue, Jan 30, 2018 at
>>> 1:37 AM,
>>> >                                             Jean-Baptiste Onofré
>>> >                                             <j...@nanthrax.net
>>> >                                             <mailto:j...@nanthrax.net>
>>> >                                             <mailto:j...@nanthrax.net
>>> >                                             <mailto:j...@nanthrax.net>>>
>>> wrote:
>>> >
>>> >                                                 Hi,
>>> >
>>> >                                                 I think we should
>>> avoid to mix
>>> >                                             two things in the
>>> discussion (and so
>>> >                                                 the document):
>>> >
>>> >                                                 1. The element of the
>>> collection
>>> >                                             and the schema itself are
>>> two
>>> >                                                 different things.
>>> >                                                 By essence, Beam
>>> should not
>>> >                                             enforce any schema. That's
>>> why I think
>>> >                                                 it's a good
>>> >                                                 idea to set the schema
>>> >                                             optionally on the
>>> PCollection
>>> >
>>> (pcollection.setSchema()).
>>> >
>>> >                                                 2. From point 1 comes
>>> two
>>> >                                             questions: how do we
>>> represent a
>>> >                                             schema ?
>>> >                                                 How can we
>>> >                                                 leverage the schema to
>>> simplify
>>> >                                             the serialization of the
>>> element in the
>>> >                                                 PCollection and query
>>> ? These
>>> >                                             two questions are not
>>> directly related.
>>> >
>>> >                                                   2.1 How do we
>>> represent the schema
>>> >                                                 Json Schema is a very
>>> >                                             interesting idea. It could
>>> be an
>>> >                                             abstract and
>>> >                                                 other
>>> >                                                 providers, like Avro,
>>> can be
>>> >                                             bind on it. It's part of
>>> the json
>>> >                                                 processing spec
>>> >                                                 (javax).
>>> >
>>> >                                                   2.2. How do we
>>> leverage the
>>> >                                             schema for query and
>>> serialization
>>> >                                                 Also in the spec, json
>>> pointer
>>> >                                             is interesting for the
>>> querying.
>>> >                                                 Regarding the
>>> >                                                 serialization, jackson
>>> or other
>>> >                                             data binder can be used.
>>> >
>>> >                                                 It's still rough ideas
>>> in my
>>> >                                             mind, but I like Romain's
>>> idea about
>>> >                                                 json-p usage.
>>> >
>>> >                                                 Once 2.3.0 release is
>>> out, I
>>> >                                             will start to update the
>>> document with
>>> >                                                 those ideas,
>>> >                                                 and PoC.
>>> >
>>> >                                                 Thanks !
>>> >                                                 Regards
>>> >                                                 JB
>>> >
>>> >                                                 On 01/30/2018 08:42
>>> AM, Romain
>>> >                                             Manni-Bucau wrote:
>>> >                                                 >
>>> >                                                 >
>>> >                                                 > Le 30 janv. 2018
>>> 01:09,
>>> >                                             "Reuven Lax" <
>>> re...@google.com
>>> >                                             <mailto:re...@google.com>
>>> >                                             <mailto:re...@google.com
>>> >                                             <mailto:re...@google.com>>
>>> >                                                  > <mailto:
>>> re...@google.com
>>> >                                             <mailto:re...@google.com>
>>> >                                             <mailto:re...@google.com
>>> >                                             <mailto:re...@google.com>>>>
>>> a écrit :
>>> >                                                 >
>>> >                                                 >
>>> >                                                 >
>>> >                                                 >     On Mon, Jan 29,
>>> 2018 at
>>> >                                             12:17 PM, Romain
>>> Manni-Bucau
>>> >                                             <rmannibu...@gmail.com
>>> >                                             <mailto:
>>> rmannibu...@gmail.com>
>>> >                                             <mailto:
>>> rmannibu...@gmail.com
>>> >                                             <mailto:
>>> rmannibu...@gmail.com>>
>>> >                                                  >
>>> >                                              <mailto:
>>> rmannibu...@gmail.com
>>> >                                             <mailto:
>>> rmannibu...@gmail.com>
>>> >
>>> >                                                 <mailto:
>>> rmannibu...@gmail.com
>>> >                                             <mailto:
>>> rmannibu...@gmail.com>>>> wrote:
>>> >                                                  >
>>> >                                                  >         Hi
>>> >                                                  >
>>> >                                                  >         I have some
>>> questions
>>> >                                             on this: how hierarchic
>>> schemas
>>> >                                                 would work? Seems
>>> >                                                  >         it is not
>>> really
>>> >                                             supported by the ecosystem
>>> (out of
>>> >                                                 custom stuff) :(.
>>> >                                                  >         How would it
>>> >                                             integrate smoothly with
>>> other
>>> >                                             generic record
>>> >                                                 types - N bridges?
>>> >                                                  >
>>> >                                                  >
>>> >                                                  >     Do you mean
>>> nested
>>> >                                             schemas? What do you mean
>>> here?
>>> >                                                  >
>>> >                                                  >
>>> >                                                  > Yes, sorry - wrote
>>> the mail
>>> >                                             too late ;). Was
>>> hierarchic data and
>>> >                                                 nested schemas.
>>> >                                                  >
>>> >                                                  >
>>> >                                                  >         Concretely
>>> I wonder
>>> >                                             if using json API couldnt
>>> be
>>> >                                                 beneficial: json-p is a
>>> >                                                  >         nice generic
>>> >                                             abstraction with a built
>>> in querying
>>> >                                                 mecanism (jsonpointer)
>>> >                                                  >         but no
>>> actual
>>> >                                             serialization (even if
>>> json and
>>> >                                             binary json
>>> >                                                 are very
>>> >                                                  >         natural).
>>> The big
>>> >                                             advantage is to have a
>>> well known
>>> >                                                 ecosystem - who
>>> >                                                  >         doesnt know
>>> json
>>> >                                             today? - that beam can
>>> reuse for free:
>>> >                                                 JsonObject
>>> >                                                  >         (guess we
>>> dont want
>>> >                                             JsonValue abstraction) for
>>> the record
>>> >                                                 type,
>>> >                                                  >         jsonschema
>>> standard
>>> >                                             for the schema,
>>> jsonpointer for the
>>> >                                                  >
>>>  delection/projection
>>> >                                             etc... It doesnt enforce
>>> the actual
>>> >                                                 serialization
>>> >                                                  >         (json,
>>> smile, avro,
>>> >                                             ...) but provide an
>>> expressive and
>>> >                                                 alread known API
>>> >                                                  >         so i see it
>>> as a big
>>> >                                             win-win for users (no need
>>> to learn
>>> >                                                 a new API and
>>> >                                                  >         use N
>>> bridges in all
>>> >                                             ways) and beam (impls are
>>> here and
>>> >                                                 API design
>>> >                                                  >         already
>>> thought).
>>> >                                                  >
>>> >                                                  >
>>> >                                                  >     I assume you're
>>> talking
>>> >                                             about the API for setting
>>> schemas,
>>> >                                                 not using them.
>>> >                                                  >     Json has many
>>> downsides
>>> >                                             and I'm not sure it's true
>>> that
>>> >                                                 everyone knows it;
>>> >                                                  >     there are also
>>> competing
>>> >                                             schema APIs, such as Avro
>>> etc..
>>> >                                                 However I think we
>>> >                                                  >     should give
>>> Json a fair
>>> >                                             evaluation before
>>> dismissing it.
>>> >                                                  >
>>> >                                                  >
>>> >                                                  > It is a wider topic
>>> than
>>> >                                             schema. Actually schema
>>> are not the
>>> >                                                 first citizen but a
>>> >                                                  > generic data
>>> representation
>>> >                                             is. That is where json
>>> hits almost
>>> >                                                 any other API.
>>> >                                                  > Then, when it comes
>>> to
>>> >                                             schema, json has a
>>> standard for that
>>> >                                             so we
>>> >                                                 are all good.
>>> >                                                  >
>>> >                                                  > Also json has a
>>> good indexing
>>> >                                             API compared to
>>> alternatives which
>>> >                                                 are sometimes a
>>> >                                                  > bit faster - for
>>> noop
>>> >                                             transforms - but are
>>> hardly usable
>>> >                                             or make
>>> >                                                 the code not
>>> >                                                  > that readable.
>>> >                                                  >
>>> >                                                  > Avro is a nice
>>> competitor but
>>> >                                             it is compatible -
>>> actually avro is
>>> >                                                 json driven by
>>> >                                                  > design - but its
>>> API is far
>>> >                                             to be that easy due to its
>>> schema
>>> >                                                 enforcement which
>>> >                                                  > is heavvvyyy and
>>> worse is you
>>> >                                             cant work with avro
>>> without a
>>> >                                                 schema. Json would
>>> >                                                  > allow to
>>> reconciliate the
>>> >                                             dynamic and static cases
>>> since the job
>>> >                                                 wouldnt change
>>> >                                                  > except the
>>> setschema.
>>> >                                                  >
>>> >                                                  > That is why I think
>>> json is a
>>> >                                             good compromise and having
>>> a
>>> >                                                 standard API for it
>>> >                                                  > allow to fully
>>> customize the
>>> >                                             imol as will if needed -
>>> even using
>>> >                                                 avro or protobuf.
>>> >                                                  >
>>> >                                                  > Side note on beam
>>> api: i dont
>>> >                                             think it is good to use a
>>> main API
>>> >                                                 for runner
>>> >                                                  > optimization. It
>>> enforces
>>> >                                             something to be shared on
>>> all runners
>>> >                                                 but not widely
>>> >                                                  > usable. It is also
>>> misleading
>>> >                                             for users. Would you set a
>>> flink
>>> >                                                 pipeline option
>>> >                                                  > with dataflow? My
>>> proposal
>>> >                                             here is to use hints -
>>> properties -
>>> >                                                 instead of
>>> >                                                  > something hardly
>>> defined in
>>> >                                             the API then standardize
>>> it if all
>>> >                                                 runners support it.
>>> >                                                  >
>>> >                                                  >
>>> >                                                  >
>>> >                                                  >         Wdyt?
>>> >                                                  >
>>> >                                                  >         Le 29 janv.
>>> 2018
>>> >                                             06:24, "Jean-Baptiste
>>> Onofré"
>>> >                                                 <j...@nanthrax.net
>>> >                                             <mailto:j...@nanthrax.net>
>>> >                                             <mailto:j...@nanthrax.net
>>> >                                             <mailto:j...@nanthrax.net>>
>>> >                                                  >
>>> >                                              <mailto:j...@nanthrax.net
>>> >                                             <mailto:j...@nanthrax.net>
>>> >                                             <mailto:j...@nanthrax.net
>>> >                                             <mailto:j...@nanthrax.net>>>>
>>> a écrit :
>>> >
>>> >                                                  >
>>> >                                                  >             Hi
>>> Reuven,
>>> >                                                  >
>>> >                                                  >             Thanks
>>> for the
>>> >                                             update ! As I'm working
>>> with you on
>>> >                                                 this, I fully
>>> >                                                  >             agree
>>> and great
>>> >                                                  >             doc
>>> gathering the
>>> >                                             ideas.
>>> >                                                  >
>>> >                                                  >             It's
>>> clearly
>>> >                                             something we have to add
>>> asap in Beam,
>>> >                                                 because it would
>>> >                                                  >             allow
>>> new
>>> >                                                  >             use
>>> cases for our
>>> >                                             users (in a simple way)
>>> and open
>>> >                                                 new areas for the
>>> >                                                  >             runners
>>> >                                                  >             (for
>>> instance
>>> >                                             dataframe support in the
>>> Spark runner).
>>> >                                                  >
>>> >                                                  >             By the
>>> way, while
>>> >                                             ago, I created BEAM-3437
>>> to track
>>> >                                                 the PoC/PR
>>> >                                                  >             around
>>> this.
>>> >                                                  >
>>> >                                                  >             Thanks !
>>> >                                                  >
>>> >                                                  >             Regards
>>> >                                                  >             JB
>>> >                                                  >
>>> >                                                  >             On
>>> 01/29/2018
>>> >                                             02:08 AM, Reuven Lax wrote:
>>> >                                                  >             >
>>> Previously I
>>> >                                             submitted a proposal for
>>> adding
>>> >                                                 schemas as a
>>> >                                                  >
>>>  first-class
>>> >                                             concept on
>>> >                                                  >             > Beam
>>> >                                             PCollections. The proposal
>>> >                                             engendered quite a
>>> >                                                 bit of
>>> >                                                  >
>>>  discussion from the
>>> >                                                  >             >
>>> community -
>>> >                                             more discussion than I've
>>> seen from
>>> >                                                 almost any of our
>>> >                                                  >
>>>  proposals to
>>> >                                                  >             > date!
>>> >                                                  >             >
>>> >                                                  >             > Based
>>> on the
>>> >                                             feedback and comments, I
>>> reworked the
>>> >                                                 proposal
>>> >                                                  >
>>>  document quite a
>>> >                                                  >             > bit.
>>> It now
>>> >                                             talks more explicitly
>>> about the
>>> >                                                 different between
>>> >                                                  >             dynamic
>>> schemas
>>> >                                                  >             >
>>> (where the
>>> >                                             schema is not fully not
>>> know at
>>> >                                                 graph-creation time),
>>> >                                                  >             and
>>> static
>>> >                                                  >             >
>>> schemas (which
>>> >                                             are fully know at
>>> graph-creation
>>> >                                                 time). Proposed
>>> >                                                  >             APIs
>>> are more
>>> >                                                  >             >
>>> fleshed out now
>>> >                                             (again thanks to feedback
>>> from
>>> >                                                 community members),
>>> >                                                  >             and the
>>> >                                                  >             >
>>> document talks
>>> >                                             in more detail about
>>> evolving schemas in
>>> >                                                  >
>>>  long-running
>>> >                                             streaming
>>> >                                                  >             >
>>> pipelines.
>>> >                                                  >             >
>>> >                                                  >             >
>>> Please take a
>>> >                                             look. I think this will be
>>> very
>>> >                                                 valuable to Beam,
>>> >                                                  >             and
>>> welcome any
>>> >                                                  >             >
>>> feedback.
>>> >                                                  >             >
>>> >                                                  >             >
>>> >                                                  >
>>> >
>>> >
>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUm
>>> Q12pHGK0QIvXS1FOTgRc/edit#
>>> >                                             <
>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruU
>>> mQ12pHGK0QIvXS1FOTgRc/edit#>
>>> >
>>> >                                             <
>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruU
>>> mQ12pHGK0QIvXS1FOTgRc/edit#
>>> >                                             <
>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruU
>>> mQ12pHGK0QIvXS1FOTgRc/edit#>>
>>> >                                                  >
>>> >                                              <
>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXru
>>> UmQ12pHGK0QIvXS1FOTgRc/edit#
>>> >                                             <
>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruU
>>> mQ12pHGK0QIvXS1FOTgRc/edit#>
>>> >                                             <
>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruU
>>> mQ12pHGK0QIvXS1FOTgRc/edit#
>>> >                                             <
>>> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruU
>>> mQ12pHGK0QIvXS1FOTgRc/edit#>>>
>>> >                                                  >             >
>>> >                                                  >             > Reuven
>>> >                                                  >
>>> >                                                  >             --
>>> >                                                  >
>>>  Jean-Baptiste Onofré
>>> >                                                  > jbono...@apache.org
>>> >                                             <mailto:
>>> jbono...@apache.org>
>>> >                                             <mailto:
>>> jbono...@apache.org
>>> >                                             <mailto:
>>> jbono...@apache.org>>
>>> >                                                 <mailto:
>>> jbono...@apache.org
>>> >                                             <mailto:
>>> jbono...@apache.org>
>>> >                                             <mailto:
>>> jbono...@apache.org
>>> >                                             <mailto:
>>> jbono...@apache.org>>>
>>> >                                                  >
>>> http://blog.nanthrax.net
>>> >                                                  >             Talend -
>>> >                                             http://www.talend.com
>>> >                                                  >
>>> >                                                  >
>>> >                                                  >
>>> >
>>> >                                                 --
>>> >                                                 Jean-Baptiste Onofré
>>> >                                                 jbono...@apache.org
>>> >                                             <mailto:
>>> jbono...@apache.org>
>>> >                                             <mailto:
>>> jbono...@apache.org
>>> >                                             <mailto:
>>> jbono...@apache.org>>
>>> >
>>> http://blog.nanthrax.net
>>> >                                                 Talend -
>>> http://www.talend.com
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>>
>>
>
>

Re: Schema-Aware PCollections revisited

Reply via email to