Re: [DISCUSS] Portability representation of schemas

Kenneth Knowles Tue, 11 Jun 2019 11:05:06 -0700

Snipping because the context is getting out of hand.

On Mon, Jun 10, 2019 at 3:42 PM Robert Bradshaw <[email protected]> wrote:


> On Mon, Jun 10, 2019 at 11:53 PM Kenneth Knowles <[email protected]> wrote:
>
>> Most things you would do directly to a representation without knowing
>> what it represents are going to be nonsense. But not everything. Two things
>> that come to mind: (1) you might do a pipeline upgrade and widen the set of
>> fields, (2) you might transpose from row-oriented to column-oriented
>> encoding (more generally schemas may allow a variety of meta-formats).
>> Notably in (2) the multiple fields in a logical type are not actually
>> represented as a contiguous bytestring.
>>
>
> Yes. For all of these, I'd say it understands the encoding, but not the
> type itself. This also seems to suggest that logical types are more than
> aliases, or mappings to an SDK-specific representation.
>

Definitely not just aliases, nor just mappings to SDK-specific
representations. The URN (+ payload) should determine the mathematical set
of values foremost.

(It may be valuable to consider allowing attributes such as "this is an
> ordered type whose ordering is the same as its representation" which could
> allow for more operations to be performed without a complete understanding.)
>

Yes, these seem valuable metadata potentially. But may be implicit.


> Pipeline-level scoping should only be transient ids generated as fresh
>> identifiers.
>>
>> As with all URNs in Beam, there's the possibility that libraries go and
>> choose the same URN for the transforms, coders, logical types. URLs thus
>> have an authority section, but I don't think we have to solve that. By
>> default aliases that a library or user defines can just be
>> "urn:beam:schema:logicaltype:javasdk" with a to/from/clazz payload. And to
>> take that to "urn:beam:schema:logicaltype:my_standardized_type" should
>> really go through dev@ and some constant in a proto file, and will have
>> coding overhead in the SDK to make sure the toProto function uses that
>> instead of the default URN. A library might make up a namespace without
>> going through dev@ and that will be mostly harmless.
>>
>
> It sounded like the registry was a way of saying "for this particular
> class, use this FieldType" which could run into issues if library A and
> library B both try to register something for a class defined in library
> (possibly the standard library) C. Or, even, "for this URN, please use this
> particular Class (and its associated FieldType). And that these
> registrations would somehow have to be preserved for execution.
>

I believe the schema registry is a transient construction-time concept. I
don't think there's any need for a concept of a registry in the portable
representation.

I'd rather urn:beam:schema:logicaltype:javasdk not be used whenever one has
> (say) a Java POJO as that would prevent other SDKs from "understanding" it
> as above (unless we had a way of declaring it as "just an alias/wrapper").
>

I didn't understand the example I snipped, but I think I understand your
concern here. Is this what you want? (a) something presented as a POJO in
Java (b) encoded to a row, but still decoded to the POJO and (c) non-Java
SDK knows that it is "just a struct" so it is safe to mess about with or
even create new ones. If this is what you want it seems potentially useful,
but also easy to live without. This can also be done entirely within the
Java SDK via conversions, leaving no logical type in the portable pipeline.

Kenn

Re: [DISCUSS] Portability representation of schemas

Reply via email to