Re: [DISCUSS] Portability representation of schemas

Robert Bradshaw Thu, 13 Jun 2019 06:29:00 -0700

On Thu, Jun 13, 2019 at 5:47 AM Reuven Lax <re...@google.com> wrote:

>
> On Wed, Jun 12, 2019 at 8:29 PM Kenneth Knowles <k...@apache.org> wrote:
>
>> Can we choose a first step? I feel there's consensus around:
>>
>>  - the basic idea of what a schema looks like, ignoring logical types or
>> SDK-specific bits
>>  - the version of logical type which is a standardized URN+payload plus a
>> representation
>>
>> Perhaps we could commit this and see what it looks like to try to use it?
>>
>
+1



> It also seems like there might be consensus around the idea of each of:
>>
>>  - a coder that simply encodes rows; its payload is just a schema; it is
>> minimalist, canonical
>>
>  - a coder that encodes a non-row using the serialization format of a row;
>> this has to be a coder (versus Convert transforms) so that to/from row
>> conversions can be elided when primitives are fused (just like to/from
>> bytes is elided)
>>
>
So, to make it concrete, in the Beam protos we would have an
[Elementwise]SchemaCoder whose single parameterization would be FieldType,
whose definition is in terms of URN + payload + components (+
representation, for non-primitive types, some details TBD there). It could
be deserialized into various different Coder instances (an SDK
implementation detail) in an SDK depending on the type. One of the most
important primitive field types is Row (aka Struct).

We would define a byte encoding for each primitive type. We *could* choose
to simply require that the encoding of any non-row primitive is the same as
its encoding in a single-member row, but that's not necessary.

In the short term, the window/timestamp/pane info would still live outside
via an enclosing WindowCoder, as it does now, not blocking on a desirable
but still-to-be-figured-out unification at that level.

This seems like a good path forward.

Actually this doesn't make sense to me. I think from the portability
> perspective, all we have is schemas - the rest is just a convenience for
> the SDK. As such, I don't think it makes sense at all to model this as a
> Coder.
>

Coder and Schemas are mutually exclusive on PCollections, and completely
specify type information, so I think it makes sense to reuse this (as we're
currently doing) until we can get rid of coders altogether.

(At execution time, we would generalize the notion of a coder to indicate
how *batches* of elements are encoded, not just how individual elements are
encoded. Here we have the option of letting the runner pick depending on
the use (e.g. elementwise for key lookups vs. arrow for bulk data channel
transfer vs ???, possibly with parameters like "preferred batch size") or
standardizing on one physical byte representation for all communication
over the boundary.)


>
>
>>
>> Can we also just have both of these, with different URNs?
>>
>> Kenn
>>
>> On Wed, Jun 12, 2019 at 3:57 PM Reuven Lax <re...@google.com> wrote:
>>
>>>
>>>
>>> On Wed, Jun 12, 2019 at 3:46 PM Robert Bradshaw <rober...@google.com>
>>> wrote:
>>>
>>>> On Tue, Jun 11, 2019 at 8:04 PM Kenneth Knowles <k...@apache.org>
>>>> wrote:
>>>>
>>>>>
>>>>> I believe the schema registry is a transient construction-time
>>>>> concept. I don't think there's any need for a concept of a registry in the
>>>>> portable representation.
>>>>>
>>>>> I'd rather urn:beam:schema:logicaltype:javasdk not be used whenever
>>>>>> one has (say) a Java POJO as that would prevent other SDKs from
>>>>>> "understanding" it as above (unless we had a way of declaring it as "just
>>>>>> an alias/wrapper").
>>>>>>
>>>>>
>>>>> I didn't understand the example I snipped, but I think I understand
>>>>> your concern here. Is this what you want? (a) something presented as a 
>>>>> POJO
>>>>> in Java (b) encoded to a row, but still decoded to the POJO and (c)
>>>>> non-Java SDK knows that it is "just a struct" so it is safe to mess about
>>>>> with or even create new ones. If this is what you want it seems 
>>>>> potentially
>>>>> useful, but also easy to live without. This can also be done entirely
>>>>> within the Java SDK via conversions, leaving no logical type in the
>>>>> portable pipeline.
>>>>>
>>>>
>>>> I'm imaging a world where someone defines a PTransform that takes a
>>>> POJO for a constructor, and consumes and produces a POJO, and is now usable
>>>> from Go with no additional work on the PTransform author's part.  But maybe
>>>> I'm thinking about this wrong and the POJO <-> Row conversion is part of
>>>> the @ProcesssElement magic, not encoded in the schema itself.
>>>>
>>>
>>> The user's output would have to be explicitly schema. They would somehow
>>> have to tell Beam the infer a schema from the output POJO (e.g. one way to
>>> do this is to annotate the POJO with the @DefaultSchema annotation).  We
>>> don't currently magically turn a POJO into a schema unless we are asked to
>>> do so.
>>>
>>

Re: [DISCUSS] Portability representation of schemas

Reply via email to