Yes that's pretty much what I had in mind. The one point I'm unsure about is that I was thinking the *calling* SDK would need to insert the transform to convert to/from Rows (unless it's an SDK that uses the portable SchemaCoder everywhere and doesn't need a conversion). For example, python might do this in ExternalTransform's expand function [1]. I was thinking that an expansion service would only serve transforms that operate on PCollections with standard coders, so you wouldn't need a conversion there, but maybe I'm mistaken.
Either way, you've captured the point: I think we could provide the niceties of the Java Schema API, without including anything SDK-specific in the portable representation of SchemaCoder, by having one JavaSchemaCoder and one PortableSchemaCoder that we can convert between transparent to the user. I put up a PR [2] that updates the Schema representation based on Kenn's "type-constructor based" alternative, and uses it in Java's SchemaTranslation. It doesn't actually touch any of the coders yet, they're all still just implemented as custom coders. [1] https://github.com/apache/beam/blob/4c322107ca5ebc0ab1cc6581d957501fd3ed9cc4/sdks/python/apache_beam/transforms/external.py#L44 [2] https://github.com/apache/beam/pull/8853 On Thu, Jun 13, 2019 at 11:42 AM Reuven Lax <[email protected]> wrote: > Spoke to Brian about his proposal. It is essentially this: > > We create PortableSchemaCoder, with a well-known URN. This coder is > parameterized by the schema (i.e. list of field name -> field type pairs). > > Java also continues to have its own CustomSchemaCoder. This is > parameterized by the schema as well as the to/from functions needed to make > the Java API "nice." > > When the expansion service expands a Java PTransform for usage across > languages, it will add a transform mapping the PCollection with > CustomSchemaCoder to a PCollection which has PortableSchemaCoder. This way > Java can maintain the information needed to maintain its API (and Python > can do the same), but there's no need to shove this information into the > well-known portable representation. > > Brian, can you confirm that this was your proposal? If so, I like it. > > We've gone back and forth discussing abstracts for over a month now. I > suggest that the next step should be to create a PR, and move discussion to > that PR. Having actual code can often make discussion much more concrete. > > Reuven > > On Thu, Jun 13, 2019 at 6:28 AM Robert Bradshaw <[email protected]> > wrote: > >> On Thu, Jun 13, 2019 at 5:47 AM Reuven Lax <[email protected]> wrote: >> >>> >>> On Wed, Jun 12, 2019 at 8:29 PM Kenneth Knowles <[email protected]> wrote: >>> >>>> Can we choose a first step? I feel there's consensus around: >>>> >>>> - the basic idea of what a schema looks like, ignoring logical types >>>> or SDK-specific bits >>>> - the version of logical type which is a standardized URN+payload plus >>>> a representation >>>> >>>> Perhaps we could commit this and see what it looks like to try to use >>>> it? >>>> >>> >> +1 >> >> >>> It also seems like there might be consensus around the idea of each of: >>>> >>>> - a coder that simply encodes rows; its payload is just a schema; it >>>> is minimalist, canonical >>>> >>> - a coder that encodes a non-row using the serialization format of a >>>> row; this has to be a coder (versus Convert transforms) so that to/from row >>>> conversions can be elided when primitives are fused (just like to/from >>>> bytes is elided) >>>> >>> >> So, to make it concrete, in the Beam protos we would have an >> [Elementwise]SchemaCoder whose single parameterization would be FieldType, >> whose definition is in terms of URN + payload + components (+ >> representation, for non-primitive types, some details TBD there). It could >> be deserialized into various different Coder instances (an SDK >> implementation detail) in an SDK depending on the type. One of the most >> important primitive field types is Row (aka Struct). >> >> We would define a byte encoding for each primitive type. We *could* >> choose to simply require that the encoding of any non-row primitive is the >> same as its encoding in a single-member row, but that's not necessary. >> >> In the short term, the window/timestamp/pane info would still live >> outside via an enclosing WindowCoder, as it does now, not blocking on a >> desirable but still-to-be-figured-out unification at that level. >> >> This seems like a good path forward. >> >> Actually this doesn't make sense to me. I think from the portability >>> perspective, all we have is schemas - the rest is just a convenience for >>> the SDK. As such, I don't think it makes sense at all to model this as a >>> Coder. >>> >> >> Coder and Schemas are mutually exclusive on PCollections, and completely >> specify type information, so I think it makes sense to reuse this (as we're >> currently doing) until we can get rid of coders altogether. >> >> (At execution time, we would generalize the notion of a coder to indicate >> how *batches* of elements are encoded, not just how individual elements are >> encoded. Here we have the option of letting the runner pick depending on >> the use (e.g. elementwise for key lookups vs. arrow for bulk data channel >> transfer vs ???, possibly with parameters like "preferred batch size") or >> standardizing on one physical byte representation for all communication >> over the boundary.) >> >> >>> >>> >>>> >>>> Can we also just have both of these, with different URNs? >>>> >>>> Kenn >>>> >>>> On Wed, Jun 12, 2019 at 3:57 PM Reuven Lax <[email protected]> wrote: >>>> >>>>> >>>>> >>>>> On Wed, Jun 12, 2019 at 3:46 PM Robert Bradshaw <[email protected]> >>>>> wrote: >>>>> >>>>>> On Tue, Jun 11, 2019 at 8:04 PM Kenneth Knowles <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> I believe the schema registry is a transient construction-time >>>>>>> concept. I don't think there's any need for a concept of a registry in >>>>>>> the >>>>>>> portable representation. >>>>>>> >>>>>>> I'd rather urn:beam:schema:logicaltype:javasdk not be used whenever >>>>>>>> one has (say) a Java POJO as that would prevent other SDKs from >>>>>>>> "understanding" it as above (unless we had a way of declaring it as >>>>>>>> "just >>>>>>>> an alias/wrapper"). >>>>>>>> >>>>>>> >>>>>>> I didn't understand the example I snipped, but I think I understand >>>>>>> your concern here. Is this what you want? (a) something presented as a >>>>>>> POJO >>>>>>> in Java (b) encoded to a row, but still decoded to the POJO and (c) >>>>>>> non-Java SDK knows that it is "just a struct" so it is safe to mess >>>>>>> about >>>>>>> with or even create new ones. If this is what you want it seems >>>>>>> potentially >>>>>>> useful, but also easy to live without. This can also be done entirely >>>>>>> within the Java SDK via conversions, leaving no logical type in the >>>>>>> portable pipeline. >>>>>>> >>>>>> >>>>>> I'm imaging a world where someone defines a PTransform that takes a >>>>>> POJO for a constructor, and consumes and produces a POJO, and is now >>>>>> usable >>>>>> from Go with no additional work on the PTransform author's part. But >>>>>> maybe >>>>>> I'm thinking about this wrong and the POJO <-> Row conversion is part of >>>>>> the @ProcesssElement magic, not encoded in the schema itself. >>>>>> >>>>> >>>>> The user's output would have to be explicitly schema. They would >>>>> somehow have to tell Beam the infer a schema from the output POJO (e.g. >>>>> one >>>>> way to do this is to annotate the POJO with the @DefaultSchema >>>>> annotation). We don't currently magically turn a POJO into a schema >>>>> unless >>>>> we are asked to do so. >>>>> >>>>
