Re: [DISCUSS] Portability representation of schemas

Brian Hulette Thu, 13 Jun 2019 16:32:49 -0700

Yes that's pretty much what I had in mind. The one point I'm unsure about
is that I was thinking the *calling* SDK would need to insert the transform
to convert to/from Rows (unless it's an SDK that uses the portable
SchemaCoder everywhere and doesn't need a conversion). For example, python
might do this in ExternalTransform's expand function [1]. I was thinking
that an expansion service would only serve transforms that operate on
PCollections with standard coders, so you wouldn't need a conversion there,
but maybe I'm mistaken.


Either way, you've captured the point: I think we could provide the
niceties of the Java Schema API, without including anything SDK-specific in
the portable representation of SchemaCoder, by having one JavaSchemaCoder
and one PortableSchemaCoder that we can convert between transparent to the
user.

I put up a PR [2] that updates the Schema representation based on Kenn's
"type-constructor based" alternative, and uses it in Java's
SchemaTranslation. It doesn't actually touch any of the coders yet, they're
all still just implemented as custom coders.

[1]
https://github.com/apache/beam/blob/4c322107ca5ebc0ab1cc6581d957501fd3ed9cc4/sdks/python/apache_beam/transforms/external.py#L44
[2] https://github.com/apache/beam/pull/8853

On Thu, Jun 13, 2019 at 11:42 AM Reuven Lax <re...@google.com> wrote:

> Spoke to Brian about his proposal. It is essentially this:
>
> We create PortableSchemaCoder, with a well-known URN. This coder is
> parameterized by the schema (i.e. list of field name -> field type pairs).
>
> Java also continues to have its own CustomSchemaCoder. This is
> parameterized by the schema as well as the to/from functions needed to make
> the Java API "nice."
>
> When the expansion service expands a Java PTransform for usage across
> languages, it will add a transform mapping the  PCollection with
> CustomSchemaCoder to a PCollection which has PortableSchemaCoder. This way
> Java can maintain the information needed to maintain its API (and Python
> can do the same), but there's no need to shove this information into the
> well-known portable representation.
>
> Brian, can you confirm that this was your proposal? If so, I like it.
>
> We've gone back and forth discussing abstracts for over a month now. I
> suggest that the next step should be to create a PR, and move discussion to
> that PR. Having actual code can often make discussion much more concrete.
>
> Reuven
>
> On Thu, Jun 13, 2019 at 6:28 AM Robert Bradshaw <rober...@google.com>
> wrote:
>
>> On Thu, Jun 13, 2019 at 5:47 AM Reuven Lax <re...@google.com> wrote:
>>
>>>
>>> On Wed, Jun 12, 2019 at 8:29 PM Kenneth Knowles <k...@apache.org> wrote:
>>>
>>>> Can we choose a first step? I feel there's consensus around:
>>>>
>>>>  - the basic idea of what a schema looks like, ignoring logical types
>>>> or SDK-specific bits
>>>>  - the version of logical type which is a standardized URN+payload plus
>>>> a representation
>>>>
>>>> Perhaps we could commit this and see what it looks like to try to use
>>>> it?
>>>>
>>>
>> +1
>>
>>
>>> It also seems like there might be consensus around the idea of each of:
>>>>
>>>>  - a coder that simply encodes rows; its payload is just a schema; it
>>>> is minimalist, canonical
>>>>
>>>  - a coder that encodes a non-row using the serialization format of a
>>>> row; this has to be a coder (versus Convert transforms) so that to/from row
>>>> conversions can be elided when primitives are fused (just like to/from
>>>> bytes is elided)
>>>>
>>>
>> So, to make it concrete, in the Beam protos we would have an
>> [Elementwise]SchemaCoder whose single parameterization would be FieldType,
>> whose definition is in terms of URN + payload + components (+
>> representation, for non-primitive types, some details TBD there). It could
>> be deserialized into various different Coder instances (an SDK
>> implementation detail) in an SDK depending on the type. One of the most
>> important primitive field types is Row (aka Struct).
>>
>> We would define a byte encoding for each primitive type. We *could*
>> choose to simply require that the encoding of any non-row primitive is the
>> same as its encoding in a single-member row, but that's not necessary.
>>
>> In the short term, the window/timestamp/pane info would still live
>> outside via an enclosing WindowCoder, as it does now, not blocking on a
>> desirable but still-to-be-figured-out unification at that level.
>>
>> This seems like a good path forward.
>>
>> Actually this doesn't make sense to me. I think from the portability
>>> perspective, all we have is schemas - the rest is just a convenience for
>>> the SDK. As such, I don't think it makes sense at all to model this as a
>>> Coder.
>>>
>>
>> Coder and Schemas are mutually exclusive on PCollections, and completely
>> specify type information, so I think it makes sense to reuse this (as we're
>> currently doing) until we can get rid of coders altogether.
>>
>> (At execution time, we would generalize the notion of a coder to indicate
>> how *batches* of elements are encoded, not just how individual elements are
>> encoded. Here we have the option of letting the runner pick depending on
>> the use (e.g. elementwise for key lookups vs. arrow for bulk data channel
>> transfer vs ???, possibly with parameters like "preferred batch size") or
>> standardizing on one physical byte representation for all communication
>> over the boundary.)
>>
>>
>>>
>>>
>>>>
>>>> Can we also just have both of these, with different URNs?
>>>>
>>>> Kenn
>>>>
>>>> On Wed, Jun 12, 2019 at 3:57 PM Reuven Lax <re...@google.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Wed, Jun 12, 2019 at 3:46 PM Robert Bradshaw <rober...@google.com>
>>>>> wrote:
>>>>>
>>>>>> On Tue, Jun 11, 2019 at 8:04 PM Kenneth Knowles <k...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> I believe the schema registry is a transient construction-time
>>>>>>> concept. I don't think there's any need for a concept of a registry in 
>>>>>>> the
>>>>>>> portable representation.
>>>>>>>
>>>>>>> I'd rather urn:beam:schema:logicaltype:javasdk not be used whenever
>>>>>>>> one has (say) a Java POJO as that would prevent other SDKs from
>>>>>>>> "understanding" it as above (unless we had a way of declaring it as 
>>>>>>>> "just
>>>>>>>> an alias/wrapper").
>>>>>>>>
>>>>>>>
>>>>>>> I didn't understand the example I snipped, but I think I understand
>>>>>>> your concern here. Is this what you want? (a) something presented as a 
>>>>>>> POJO
>>>>>>> in Java (b) encoded to a row, but still decoded to the POJO and (c)
>>>>>>> non-Java SDK knows that it is "just a struct" so it is safe to mess 
>>>>>>> about
>>>>>>> with or even create new ones. If this is what you want it seems 
>>>>>>> potentially
>>>>>>> useful, but also easy to live without. This can also be done entirely
>>>>>>> within the Java SDK via conversions, leaving no logical type in the
>>>>>>> portable pipeline.
>>>>>>>
>>>>>>
>>>>>> I'm imaging a world where someone defines a PTransform that takes a
>>>>>> POJO for a constructor, and consumes and produces a POJO, and is now 
>>>>>> usable
>>>>>> from Go with no additional work on the PTransform author's part.  But 
>>>>>> maybe
>>>>>> I'm thinking about this wrong and the POJO <-> Row conversion is part of
>>>>>> the @ProcesssElement magic, not encoded in the schema itself.
>>>>>>
>>>>>
>>>>> The user's output would have to be explicitly schema. They would
>>>>> somehow have to tell Beam the infer a schema from the output POJO (e.g. 
>>>>> one
>>>>> way to do this is to annotate the POJO with the @DefaultSchema
>>>>> annotation).  We don't currently magically turn a POJO into a schema 
>>>>> unless
>>>>> we are asked to do so.
>>>>>
>>>>

Re: [DISCUSS] Portability representation of schemas

Reply via email to