Re: [DISCUSS] Portability representation of schemas

Reuven Lax Thu, 13 Jun 2019 19:41:57 -0700

As Luke mentioned above, we don't need to add a new mapping transform. We
can simply create a wrapping coder, that wraps the Java coder.


On Thu, Jun 13, 2019 at 4:32 PM Brian Hulette <[email protected]> wrote:

> Yes that's pretty much what I had in mind. The one point I'm unsure about
> is that I was thinking the *calling* SDK would need to insert the transform
> to convert to/from Rows (unless it's an SDK that uses the portable
> SchemaCoder everywhere and doesn't need a conversion). For example, python
> might do this in ExternalTransform's expand function [1]. I was thinking
> that an expansion service would only serve transforms that operate on
> PCollections with standard coders, so you wouldn't need a conversion there,
> but maybe I'm mistaken.
>
> Either way, you've captured the point: I think we could provide the
> niceties of the Java Schema API, without including anything SDK-specific in
> the portable representation of SchemaCoder, by having one JavaSchemaCoder
> and one PortableSchemaCoder that we can convert between transparent to the
> user.
>
> I put up a PR [2] that updates the Schema representation based on Kenn's
> "type-constructor based" alternative, and uses it in Java's
> SchemaTranslation. It doesn't actually touch any of the coders yet, they're
> all still just implemented as custom coders.
>
> [1]
> https://github.com/apache/beam/blob/4c322107ca5ebc0ab1cc6581d957501fd3ed9cc4/sdks/python/apache_beam/transforms/external.py#L44
> [2] https://github.com/apache/beam/pull/8853
>
> On Thu, Jun 13, 2019 at 11:42 AM Reuven Lax <[email protected]> wrote:
>
>> Spoke to Brian about his proposal. It is essentially this:
>>
>> We create PortableSchemaCoder, with a well-known URN. This coder is
>> parameterized by the schema (i.e. list of field name -> field type pairs).
>>
>> Java also continues to have its own CustomSchemaCoder. This is
>> parameterized by the schema as well as the to/from functions needed to make
>> the Java API "nice."
>>
>> When the expansion service expands a Java PTransform for usage across
>> languages, it will add a transform mapping the  PCollection with
>> CustomSchemaCoder to a PCollection which has PortableSchemaCoder. This way
>> Java can maintain the information needed to maintain its API (and Python
>> can do the same), but there's no need to shove this information into the
>> well-known portable representation.
>>
>> Brian, can you confirm that this was your proposal? If so, I like it.
>>
>> We've gone back and forth discussing abstracts for over a month now. I
>> suggest that the next step should be to create a PR, and move discussion to
>> that PR. Having actual code can often make discussion much more concrete.
>>
>> Reuven
>>
>> On Thu, Jun 13, 2019 at 6:28 AM Robert Bradshaw <[email protected]>
>> wrote:
>>
>>> On Thu, Jun 13, 2019 at 5:47 AM Reuven Lax <[email protected]> wrote:
>>>
>>>>
>>>> On Wed, Jun 12, 2019 at 8:29 PM Kenneth Knowles <[email protected]>
>>>> wrote:
>>>>
>>>>> Can we choose a first step? I feel there's consensus around:
>>>>>
>>>>>  - the basic idea of what a schema looks like, ignoring logical types
>>>>> or SDK-specific bits
>>>>>  - the version of logical type which is a standardized URN+payload
>>>>> plus a representation
>>>>>
>>>>> Perhaps we could commit this and see what it looks like to try to use
>>>>> it?
>>>>>
>>>>
>>> +1
>>>
>>>
>>>> It also seems like there might be consensus around the idea of each of:
>>>>>
>>>>>  - a coder that simply encodes rows; its payload is just a schema; it
>>>>> is minimalist, canonical
>>>>>
>>>>  - a coder that encodes a non-row using the serialization format of a
>>>>> row; this has to be a coder (versus Convert transforms) so that to/from 
>>>>> row
>>>>> conversions can be elided when primitives are fused (just like to/from
>>>>> bytes is elided)
>>>>>
>>>>
>>> So, to make it concrete, in the Beam protos we would have an
>>> [Elementwise]SchemaCoder whose single parameterization would be FieldType,
>>> whose definition is in terms of URN + payload + components (+
>>> representation, for non-primitive types, some details TBD there). It could
>>> be deserialized into various different Coder instances (an SDK
>>> implementation detail) in an SDK depending on the type. One of the most
>>> important primitive field types is Row (aka Struct).
>>>
>>> We would define a byte encoding for each primitive type. We *could*
>>> choose to simply require that the encoding of any non-row primitive is the
>>> same as its encoding in a single-member row, but that's not necessary.
>>>
>>> In the short term, the window/timestamp/pane info would still live
>>> outside via an enclosing WindowCoder, as it does now, not blocking on a
>>> desirable but still-to-be-figured-out unification at that level.
>>>
>>> This seems like a good path forward.
>>>
>>> Actually this doesn't make sense to me. I think from the portability
>>>> perspective, all we have is schemas - the rest is just a convenience for
>>>> the SDK. As such, I don't think it makes sense at all to model this as a
>>>> Coder.
>>>>
>>>
>>> Coder and Schemas are mutually exclusive on PCollections, and completely
>>> specify type information, so I think it makes sense to reuse this (as we're
>>> currently doing) until we can get rid of coders altogether.
>>>
>>> (At execution time, we would generalize the notion of a coder to
>>> indicate how *batches* of elements are encoded, not just how individual
>>> elements are encoded. Here we have the option of letting the runner pick
>>> depending on the use (e.g. elementwise for key lookups vs. arrow for bulk
>>> data channel transfer vs ???, possibly with parameters like "preferred
>>> batch size") or standardizing on one physical byte representation for all
>>> communication over the boundary.)
>>>
>>>
>>>>
>>>>
>>>>>
>>>>> Can we also just have both of these, with different URNs?
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Wed, Jun 12, 2019 at 3:57 PM Reuven Lax <[email protected]> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 12, 2019 at 3:46 PM Robert Bradshaw <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> On Tue, Jun 11, 2019 at 8:04 PM Kenneth Knowles <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> I believe the schema registry is a transient construction-time
>>>>>>>> concept. I don't think there's any need for a concept of a registry in 
>>>>>>>> the
>>>>>>>> portable representation.
>>>>>>>>
>>>>>>>> I'd rather urn:beam:schema:logicaltype:javasdk not be used whenever
>>>>>>>>> one has (say) a Java POJO as that would prevent other SDKs from
>>>>>>>>> "understanding" it as above (unless we had a way of declaring it as 
>>>>>>>>> "just
>>>>>>>>> an alias/wrapper").
>>>>>>>>>
>>>>>>>>
>>>>>>>> I didn't understand the example I snipped, but I think I understand
>>>>>>>> your concern here. Is this what you want? (a) something presented as a 
>>>>>>>> POJO
>>>>>>>> in Java (b) encoded to a row, but still decoded to the POJO and (c)
>>>>>>>> non-Java SDK knows that it is "just a struct" so it is safe to mess 
>>>>>>>> about
>>>>>>>> with or even create new ones. If this is what you want it seems 
>>>>>>>> potentially
>>>>>>>> useful, but also easy to live without. This can also be done entirely
>>>>>>>> within the Java SDK via conversions, leaving no logical type in the
>>>>>>>> portable pipeline.
>>>>>>>>
>>>>>>>
>>>>>>> I'm imaging a world where someone defines a PTransform that takes a
>>>>>>> POJO for a constructor, and consumes and produces a POJO, and is now 
>>>>>>> usable
>>>>>>> from Go with no additional work on the PTransform author's part.  But 
>>>>>>> maybe
>>>>>>> I'm thinking about this wrong and the POJO <-> Row conversion is part of
>>>>>>> the @ProcesssElement magic, not encoded in the schema itself.
>>>>>>>
>>>>>>
>>>>>> The user's output would have to be explicitly schema. They would
>>>>>> somehow have to tell Beam the infer a schema from the output POJO (e.g. 
>>>>>> one
>>>>>> way to do this is to annotate the POJO with the @DefaultSchema
>>>>>> annotation).  We don't currently magically turn a POJO into a schema 
>>>>>> unless
>>>>>> we are asked to do so.
>>>>>>
>>>>>

Re: [DISCUSS] Portability representation of schemas

Reply via email to