Re: [DISCUSS] Portability representation of schemas

Brian Hulette Mon, 17 Jun 2019 17:38:23 -0700

Realized I completely ignored one of your points, added another response
inline.


On Fri, Jun 14, 2019 at 2:20 AM Robert Bradshaw <[email protected]> wrote:

> On Thu, Jun 13, 2019 at 8:42 PM Reuven Lax <[email protected]> wrote:
> >
> > Spoke to Brian about his proposal. It is essentially this:
> >
> > We create PortableSchemaCoder, with a well-known URN. This coder is
> parameterized by the schema (i.e. list of field name -> field type pairs).
>
> Given that we have a field type that is (list of field names -> field
> type pairs), is there a reason to do this enumeration at the top level
> as well? This would likely also eliminate some of the strangeness
> where we want to treat a PCollection with a single-field row as a
> PCollection with just that value instead.
>

This is part of what I was suggesting in my "Root schema is a logical type"
alternative [1], except that the language about SDK-specific logical types
is now obsolete. I'll update it to better reflect this alternative.
I do think at the very least we should just have one (list of field names
-> field type pairs) that is re-used, which is what I did in my PR [2].

[1]
https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit#heading=h.7570feur1qin
[2]
https://github.com/apache/beam/pull/8853/files#diff-f0d64c2cfc4583bfe2a7e5ee59818ae2L686


>
> > Java also continues to have its own CustomSchemaCoder. This is
> parameterized by the schema as well as the to/from functions needed to make
> the Java API "nice."
> >
> > When the expansion service expands a Java PTransform for usage across
> languages, it will add a transform mapping the  PCollection with
> CustomSchemaCoder to a PCollection which has PortableSchemaCoder. This way
> Java can maintain the information needed to maintain its API (and Python
> can do the same), but there's no need to shove this information into the
> well-known portable representation.
> >
> > Brian, can you confirm that this was your proposal? If so, I like it.
>
> The major downside of this that I see is that it assumes that
> transparency is only needed at certain "boundaries" and everything
> between these boundaries is opaque. I think we'd be better served by a
> format where schemas are transparently represented throughout. For
> example, the "boundaries" between runner and SDK are not known at
> pipeline construction time, and we want the runner <-> SDK
> communication to understand the schemas to be able to use more
> efficient transport mechanisms (e.g. batches of arrow records). It may
> also be common for a pipeline in language X to invoke two transforms
> in language Y in succession (e.g. two SQL statements) in which case
> introducing two extra transforms in the expansion service would be
> wasteful. I also think we want to allow the flexibility for runners to
> swap out transforms an optimizations regardless of construction-time
> boundaries (e.g. implementing a projection natively, rather than
> outsourcing to the SDK).
>
> Are the to/from conversion functions the only extra information needed
> to make the Java APIs nice? If so, can they be attached to the
> operations themselves (where it seems they're actually needed/used),
> rather than to the schema/coder of the PCollection? Alternatively, I'd
> prefer this be opaque metadata attached to a transparent schema rather
> than making the whole schema opaque.
>
> > We've gone back and forth discussing abstracts for over a month now. I
> suggest that the next step should be to create a PR, and move discussion to
> that PR. Having actual code can often make discussion much more concrete.
>
> +1 to a PR, though I feel like there are fundamental high-level issues
> that are still not decided. (I suppose we should be open to throwing
> whole PRs away in that case.) There are certainly pieces that we'll
> know that we need (like the ability to serialize a row consistently in
> all languages) we can get in immediately.
>
> > Reuven
> >
> > On Thu, Jun 13, 2019 at 6:28 AM Robert Bradshaw <[email protected]>
> wrote:
> >>
> >> On Thu, Jun 13, 2019 at 5:47 AM Reuven Lax <[email protected]> wrote:
> >>>
> >>>
> >>> On Wed, Jun 12, 2019 at 8:29 PM Kenneth Knowles <[email protected]>
> wrote:
> >>>>
> >>>> Can we choose a first step? I feel there's consensus around:
> >>>>
> >>>>  - the basic idea of what a schema looks like, ignoring logical types
> or SDK-specific bits
> >>>>  - the version of logical type which is a standardized URN+payload
> plus a representation
> >>>>
> >>>> Perhaps we could commit this and see what it looks like to try to use
> it?
> >>
> >>
> >> +1
> >>
> >>>>
> >>>> It also seems like there might be consensus around the idea of each
> of:
> >>>>
> >>>>  - a coder that simply encodes rows; its payload is just a schema; it
> is minimalist, canonical
> >>>>
> >>>>  - a coder that encodes a non-row using the serialization format of a
> row; this has to be a coder (versus Convert transforms) so that to/from row
> conversions can be elided when primitives are fused (just like to/from
> bytes is elided)
> >>
> >>
> >> So, to make it concrete, in the Beam protos we would have an
> [Elementwise]SchemaCoder whose single parameterization would be FieldType,
> whose definition is in terms of URN + payload + components (+
> representation, for non-primitive types, some details TBD there). It could
> be deserialized into various different Coder instances (an SDK
> implementation detail) in an SDK depending on the type. One of the most
> important primitive field types is Row (aka Struct).
> >>
> >> We would define a byte encoding for each primitive type. We *could*
> choose to simply require that the encoding of any non-row primitive is the
> same as its encoding in a single-member row, but that's not necessary.
> >>
> >> In the short term, the window/timestamp/pane info would still live
> outside via an enclosing WindowCoder, as it does now, not blocking on a
> desirable but still-to-be-figured-out unification at that level.
> >>
> >> This seems like a good path forward.
> >>
> >>> Actually this doesn't make sense to me. I think from the portability
> perspective, all we have is schemas - the rest is just a convenience for
> the SDK. As such, I don't think it makes sense at all to model this as a
> Coder.
> >>
> >>
> >> Coder and Schemas are mutually exclusive on PCollections, and
> completely specify type information, so I think it makes sense to reuse
> this (as we're currently doing) until we can get rid of coders altogether.
> >>
> >> (At execution time, we would generalize the notion of a coder to
> indicate how *batches* of elements are encoded, not just how individual
> elements are encoded. Here we have the option of letting the runner pick
> depending on the use (e.g. elementwise for key lookups vs. arrow for bulk
> data channel transfer vs ???, possibly with parameters like "preferred
> batch size") or standardizing on one physical byte representation for all
> communication over the boundary.)
> >>
> >>>
> >>>
> >>>>
> >>>>
> >>>> Can we also just have both of these, with different URNs?
> >>>>
> >>>> Kenn
> >>>>
> >>>> On Wed, Jun 12, 2019 at 3:57 PM Reuven Lax <[email protected]> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Wed, Jun 12, 2019 at 3:46 PM Robert Bradshaw <[email protected]>
> wrote:
> >>>>>>
> >>>>>> On Tue, Jun 11, 2019 at 8:04 PM Kenneth Knowles <[email protected]>
> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> I believe the schema registry is a transient construction-time
> concept. I don't think there's any need for a concept of a registry in the
> portable representation.
> >>>>>>>
> >>>>>>>> I'd rather urn:beam:schema:logicaltype:javasdk not be used
> whenever one has (say) a Java POJO as that would prevent other SDKs from
> "understanding" it as above (unless we had a way of declaring it as "just
> an alias/wrapper").
> >>>>>>>
> >>>>>>>
> >>>>>>> I didn't understand the example I snipped, but I think I
> understand your concern here. Is this what you want? (a) something
> presented as a POJO in Java (b) encoded to a row, but still decoded to the
> POJO and (c) non-Java SDK knows that it is "just a struct" so it is safe to
> mess about with or even create new ones. If this is what you want it seems
> potentially useful, but also easy to live without. This can also be done
> entirely within the Java SDK via conversions, leaving no logical type in
> the portable pipeline.
> >>>>>>
> >>>>>>
> >>>>>> I'm imaging a world where someone defines a PTransform that takes a
> POJO for a constructor, and consumes and produces a POJO, and is now usable
> from Go with no additional work on the PTransform author's part.  But maybe
> I'm thinking about this wrong and the POJO <-> Row conversion is part of
> the @ProcesssElement magic, not encoded in the schema itself.
> >>>>>
> >>>>>
> >>>>> The user's output would have to be explicitly schema. They would
> somehow have to tell Beam the infer a schema from the output POJO (e.g. one
> way to do this is to annotate the POJO with the @DefaultSchema
> annotation).  We don't currently magically turn a POJO into a schema unless
> we are asked to do so.
>

Re: [DISCUSS] Portability representation of schemas

Reply via email to