As Luke mentioned above, we don't need to add a new mapping transform. We can simply create a wrapping coder, that wraps the Java coder.
On Thu, Jun 13, 2019 at 4:32 PM Brian Hulette <[email protected]> wrote: > Yes that's pretty much what I had in mind. The one point I'm unsure about > is that I was thinking the *calling* SDK would need to insert the transform > to convert to/from Rows (unless it's an SDK that uses the portable > SchemaCoder everywhere and doesn't need a conversion). For example, python > might do this in ExternalTransform's expand function [1]. I was thinking > that an expansion service would only serve transforms that operate on > PCollections with standard coders, so you wouldn't need a conversion there, > but maybe I'm mistaken. > > Either way, you've captured the point: I think we could provide the > niceties of the Java Schema API, without including anything SDK-specific in > the portable representation of SchemaCoder, by having one JavaSchemaCoder > and one PortableSchemaCoder that we can convert between transparent to the > user. > > I put up a PR [2] that updates the Schema representation based on Kenn's > "type-constructor based" alternative, and uses it in Java's > SchemaTranslation. It doesn't actually touch any of the coders yet, they're > all still just implemented as custom coders. > > [1] > https://github.com/apache/beam/blob/4c322107ca5ebc0ab1cc6581d957501fd3ed9cc4/sdks/python/apache_beam/transforms/external.py#L44 > [2] https://github.com/apache/beam/pull/8853 > > On Thu, Jun 13, 2019 at 11:42 AM Reuven Lax <[email protected]> wrote: > >> Spoke to Brian about his proposal. It is essentially this: >> >> We create PortableSchemaCoder, with a well-known URN. This coder is >> parameterized by the schema (i.e. list of field name -> field type pairs). >> >> Java also continues to have its own CustomSchemaCoder. This is >> parameterized by the schema as well as the to/from functions needed to make >> the Java API "nice." >> >> When the expansion service expands a Java PTransform for usage across >> languages, it will add a transform mapping the PCollection with >> CustomSchemaCoder to a PCollection which has PortableSchemaCoder. This way >> Java can maintain the information needed to maintain its API (and Python >> can do the same), but there's no need to shove this information into the >> well-known portable representation. >> >> Brian, can you confirm that this was your proposal? If so, I like it. >> >> We've gone back and forth discussing abstracts for over a month now. I >> suggest that the next step should be to create a PR, and move discussion to >> that PR. Having actual code can often make discussion much more concrete. >> >> Reuven >> >> On Thu, Jun 13, 2019 at 6:28 AM Robert Bradshaw <[email protected]> >> wrote: >> >>> On Thu, Jun 13, 2019 at 5:47 AM Reuven Lax <[email protected]> wrote: >>> >>>> >>>> On Wed, Jun 12, 2019 at 8:29 PM Kenneth Knowles <[email protected]> >>>> wrote: >>>> >>>>> Can we choose a first step? I feel there's consensus around: >>>>> >>>>> - the basic idea of what a schema looks like, ignoring logical types >>>>> or SDK-specific bits >>>>> - the version of logical type which is a standardized URN+payload >>>>> plus a representation >>>>> >>>>> Perhaps we could commit this and see what it looks like to try to use >>>>> it? >>>>> >>>> >>> +1 >>> >>> >>>> It also seems like there might be consensus around the idea of each of: >>>>> >>>>> - a coder that simply encodes rows; its payload is just a schema; it >>>>> is minimalist, canonical >>>>> >>>> - a coder that encodes a non-row using the serialization format of a >>>>> row; this has to be a coder (versus Convert transforms) so that to/from >>>>> row >>>>> conversions can be elided when primitives are fused (just like to/from >>>>> bytes is elided) >>>>> >>>> >>> So, to make it concrete, in the Beam protos we would have an >>> [Elementwise]SchemaCoder whose single parameterization would be FieldType, >>> whose definition is in terms of URN + payload + components (+ >>> representation, for non-primitive types, some details TBD there). It could >>> be deserialized into various different Coder instances (an SDK >>> implementation detail) in an SDK depending on the type. One of the most >>> important primitive field types is Row (aka Struct). >>> >>> We would define a byte encoding for each primitive type. We *could* >>> choose to simply require that the encoding of any non-row primitive is the >>> same as its encoding in a single-member row, but that's not necessary. >>> >>> In the short term, the window/timestamp/pane info would still live >>> outside via an enclosing WindowCoder, as it does now, not blocking on a >>> desirable but still-to-be-figured-out unification at that level. >>> >>> This seems like a good path forward. >>> >>> Actually this doesn't make sense to me. I think from the portability >>>> perspective, all we have is schemas - the rest is just a convenience for >>>> the SDK. As such, I don't think it makes sense at all to model this as a >>>> Coder. >>>> >>> >>> Coder and Schemas are mutually exclusive on PCollections, and completely >>> specify type information, so I think it makes sense to reuse this (as we're >>> currently doing) until we can get rid of coders altogether. >>> >>> (At execution time, we would generalize the notion of a coder to >>> indicate how *batches* of elements are encoded, not just how individual >>> elements are encoded. Here we have the option of letting the runner pick >>> depending on the use (e.g. elementwise for key lookups vs. arrow for bulk >>> data channel transfer vs ???, possibly with parameters like "preferred >>> batch size") or standardizing on one physical byte representation for all >>> communication over the boundary.) >>> >>> >>>> >>>> >>>>> >>>>> Can we also just have both of these, with different URNs? >>>>> >>>>> Kenn >>>>> >>>>> On Wed, Jun 12, 2019 at 3:57 PM Reuven Lax <[email protected]> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Wed, Jun 12, 2019 at 3:46 PM Robert Bradshaw <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> On Tue, Jun 11, 2019 at 8:04 PM Kenneth Knowles <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> I believe the schema registry is a transient construction-time >>>>>>>> concept. I don't think there's any need for a concept of a registry in >>>>>>>> the >>>>>>>> portable representation. >>>>>>>> >>>>>>>> I'd rather urn:beam:schema:logicaltype:javasdk not be used whenever >>>>>>>>> one has (say) a Java POJO as that would prevent other SDKs from >>>>>>>>> "understanding" it as above (unless we had a way of declaring it as >>>>>>>>> "just >>>>>>>>> an alias/wrapper"). >>>>>>>>> >>>>>>>> >>>>>>>> I didn't understand the example I snipped, but I think I understand >>>>>>>> your concern here. Is this what you want? (a) something presented as a >>>>>>>> POJO >>>>>>>> in Java (b) encoded to a row, but still decoded to the POJO and (c) >>>>>>>> non-Java SDK knows that it is "just a struct" so it is safe to mess >>>>>>>> about >>>>>>>> with or even create new ones. If this is what you want it seems >>>>>>>> potentially >>>>>>>> useful, but also easy to live without. This can also be done entirely >>>>>>>> within the Java SDK via conversions, leaving no logical type in the >>>>>>>> portable pipeline. >>>>>>>> >>>>>>> >>>>>>> I'm imaging a world where someone defines a PTransform that takes a >>>>>>> POJO for a constructor, and consumes and produces a POJO, and is now >>>>>>> usable >>>>>>> from Go with no additional work on the PTransform author's part. But >>>>>>> maybe >>>>>>> I'm thinking about this wrong and the POJO <-> Row conversion is part of >>>>>>> the @ProcesssElement magic, not encoded in the schema itself. >>>>>>> >>>>>> >>>>>> The user's output would have to be explicitly schema. They would >>>>>> somehow have to tell Beam the infer a schema from the output POJO (e.g. >>>>>> one >>>>>> way to do this is to annotate the POJO with the @DefaultSchema >>>>>> annotation). We don't currently magically turn a POJO into a schema >>>>>> unless >>>>>> we are asked to do so. >>>>>> >>>>>
