Re: [PROPOSAL] Beam Schema Options

Reuven Lax Fri, 07 Feb 2020 13:56:56 -0800

I disagree - I've had several cases where user options on fields are very
useful internally.


A common rationale is to preserve fidelity. For instance, reading protos,
projecting out a few fields, writing protos back out. You want to be able
to nicely map protos to Beam schemas, but also preserve all the extra
metadata on proto fields. This metadata has to carry through on all
intermediate PCollections and schemas, so it makes sense to put it on the
field.

Another example: it provides a nice extension point for annotation
extensions to a field. These are often things like Optional or Taint
markings that don't change the semantic interpretation of the type (so
shouldn't be a logical type), but do provide extra information about the
field.

Reuven

On Fri, Feb 7, 2020 at 8:54 AM Brian Hulette <[email protected]> wrote:

> I'm not sure this belongs directly on schemas. I've had trouble
> reconciling that opinion, since the idea does seem very useful, and in fact
> I'm interested in using it myself. I think I've figured out my concern -
> what I really want is options for a (maybe portable) Table.
>
> As I indicated in a comment in the doc [1] I still think all of the
> examples you've described only apply to IOs. To be clear, what I mean is
> that all of the examples either
> 1) modify the behavior of the external system the IO is interfacing with
> (specify partitioning, indexing, etc..), or
> 2) define some transformation that should be done to the data adjacent to
> the IO (after an Input or before an Output) in Beam
>
> (1) Is the sort of thing you described in the IO section [2] (aside from
> the PubSub example I added, since that's describing a transformation to do
> in Beam)
> I would argue that all of the other examples fall under (2) - data
> validation, adding computed columns, encryption, etc... are things that can
> be done in a transform
>
> I think we can make an analogy to a traditional database here:
> schema-aware Beam IOs are like Tables in a database, other PCollections are
> like intermediate results in a query. In a database, Tables can be defined
> with some DDL and have schema-level or column-level options that change
> system behavior, but intermediate results have no such capability.
>
>
> Another point I think is worth discussing: is there value in making these
> options portable?
> As it's currently defined I'm not sure there is - everything could be done
> within a single SDK. However, portable options on a portable table could be
> very powerful, since it could be used to configure cross-language IOs,
> perhaps with something like https://s.apache.org/xlang-table-provider/
>
> [1]
> https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit?disco=AAAAI54si4k
> [2]
> https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit#heading=h.8sjt9ax55hmt
>
> On Wed, Feb 5, 2020, 4:17 AM Alex Van Boxel <[email protected]> wrote:
>
>> I would appreciate if someone would look at the following PR and get it
>> to master:
>>
>> https://github.com/apache/beam/pull/10413#
>>
>> a lot of work needs to follow, but if we have the base already on master
>> the next layers can follow. As a reminder, this is the base proposal:
>>
>> https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit?usp=sharing
>>
>> I've also looked for prior work, and saw that Spark actually has
>> something comparable:
>>
>> https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/Row.html
>>
>> but when the options are finished it will be far more powerful as it is
>> not limited on fields.
>>
>>  _/
>> _/ Alex Van Boxel
>>
>>
>> On Wed, Jan 29, 2020 at 4:55 AM Kenneth Knowles <[email protected]> wrote:
>>
>>> Using schema types for the metadata values is a nice touch.
>>>
>>> Are the options expected to be common across many fields? Perhaps the
>>> name should be a URN to make it clear to be careful about collisions? (just
>>> a synonym for "name" in practice, but with different connotation)
>>>
>>> I generally like this... but the examples (all but one) are weird things
>>> that I don't really understand how they are done or who is responsible for
>>> them.
>>>
>>> One way to go is this: if options are maybe not understood by all
>>> consumers, then they can't really change behavior. Kind of like how URN and
>>> payload on a composite transform can be ignored and just the expansion used.
>>>
>>> Kenn
>>>
>>> On Sun, Jan 26, 2020 at 8:27 AM Alex Van Boxel <[email protected]> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I'm proud to announce my first real proposal. The proposal describes
>>>> Beam Schema Options. This is an extension to the Schema API to add typed
>>>> meta data to to Rows, Field and Logical Types:
>>>>
>>>>
>>>> https://docs.google.com/document/d/1yCCRU5pViVQIO8-YAb66VRh3I-kl0F7bMez616tgM8Q/edit?usp=sharing
>>>>
>>>> To give you some context where this proposal comes from: We've been
>>>> using dynamic meta driven pipelines for a while, but till now in an
>>>> awkward and hacky way (see my talks at the previous Beam Summits). This
>>>> proposal would bring a good way to work with metadata on the metadata :-).
>>>>
>>>> The proposal points to 2 pull requests with the implementation, one for
>>>> the API another for translating proto options to beam options.
>>>>
>>>> Thanks to Brian Hulette and Reuven Lax for the initial feedback. All
>>>> feedback is welcome.
>>>>
>>>>  _/
>>>> _/ Alex Van Boxel
>>>>
>>>

Re: [PROPOSAL] Beam Schema Options

Reply via email to