What I'm working on changes ExternalConfigurationPayload [1] to this:
message ExternalConfigurationPayload {
// A schema for use in beam:coder:row:v1
Schema schema = 1;
// A payload which can be decoded using beam:coder:row:v1 and the given
schema.
bytes payload = 2;
}
The calling SDK can infer a schema from a user configuration type (in
Python we can make minor changes to SchemaBasedPayloadBuilder for this),
and use it's implementation of beam:coder:row:v1 to encode an instance of
that type to the payload.
Similarly the expanding SDK can infer a schema from a user configuration
type and map the encoded row to an instance of the user type, assuming the
schemas are compatible.
Brian
[1]
https://github.com/apache/beam/blob/86b8326b4ebc4e217585847102743cc1d1af367a/model/pipeline/src/main/proto/external_transforms.proto#L42
On Wed, Aug 5, 2020 at 2:04 AM Maximilian Michels <[email protected]> wrote:
> +1
>
> The format to store coders is not set in stone, it was a first version
> to make external configuration work. Using the Coder message would be
> better.
>
> As for using Schema to store the configuration, could somebody fill me
> in how that would work?
>
> -Max
>
> On 04.08.20 02:01, Brian Hulette wrote:
> > I've opened BEAM-10571 [1] for this, and I'm most of the way to an
> > implementation now. Aiming to have it done before the 2.24.0 cut since
> > it will be the last release with python 2 support.
> >
> > [1] https://issues.apache.org/jira/browse/BEAM-10571
> >
> > On Wed, Jul 15, 2020 at 9:03 AM Chamikara Jayalath <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> >
> >
> > On Fri, Jul 10, 2020 at 4:47 PM Robert Bradshaw <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> > On Fri, Jul 10, 2020 at 4:36 PM Brian Hulette
> > <[email protected] <mailto:[email protected]>> wrote:
> > >
> > > Ah yes I'm +1 for that approach too - it would let us
> > leverage all the schema-inference already in the Java SDK for
> > translating configuration objects which would be great.
> > > Things on the Python side would be trickier as schemas don't
> > formally support all the types you can use in the PayloadBuilder
> > implementations [1] yet, just NamedTuple. For now we could just
> > make the PayloadBuilder implementations generate Rows without
> > making that translation available for use in PCollections.
> >
> >
> > This will be a good opportunity to add some sort of a minimal Python
> > type to Beam schema mapping :)
> >
> >
> > Yes, though eventually it might be nice to support all of these
> > various types as schema'd PCollection elements as well.
> >
> > > Do we need to worry about update compatibility for
> > ExternalConfigurationPayload?
> >
> > Technically, each URN defines their payload, and the fact that
> we've
> > settled on ExternalConfigurationPayload is a convention. On a
> > practical note, we haven't declared these protos stable yet. (I
> > would
> > like to do so before we drop support for Python 2, as external
> > transforms are a possible escape hatch and the first strong
> > motivation
> > to have external transforms that span Beam versions).
> >
> >
> > +1
> >
> >
> > > [1]
> >
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/external.py
> > >
> > > On Fri, Jul 10, 2020 at 4:23 PM Robert Bradshaw
> > <[email protected] <mailto:[email protected]>> wrote:
> > >>
> > >> I would be in favor of just using a schema to store the
> entire
> > >> configuration. The reason we went with what we have to day
> > is that we
> > >> didn't have cross language schemas yet.
> > >>
> > >> On Fri, Jul 10, 2020 at 12:24 PM Brian Hulette
> > <[email protected] <mailto:[email protected]>> wrote:
> > >> >
> > >> > Hi everyone,
> > >> > I noticed that currently the ExternalConfigurationPayload
> > uses a list of coder URNs to represent the coder that was used
> > to serialize each configuration field [1]. This seems acceptable
> > at first blush, but there's one notable issue: it has no place
> > to store a payload for the coder. Most standard coders don't use
> > a payload so it's not a problem, but row coder does use a
> > payload to store it's schema, which means it can't be used in an
> > ExternalConfigurationPayload today.
> > >> >
> > >> > Is there a reason not to just use the Coder message [2] in
> > ExternalConfigurationPayload instead of a list of coder URNs?
> > That would work with row coder, and it would also make it easier
> > to re-use logic for translating Pipeline protos.
> > >> >
> > >> > I'd be happy to make this change, but I wanted to ask on
> > dev@ in case there's something I'm missing here.
> > >> >
> > >> > Brian
> > >> >
> > >> > [1]
> >
> https://github.com/apache/beam/blob/c54a0b7f49f2eb4a15df115205e2fa455116ccbe/model/pipeline/src/main/proto/external_transforms.proto#L34-L35
> > >> > [2]
> >
> https://github.com/apache/beam/blob/c54a0b7f49f2eb4a15df115205e2fa455116ccbe/model/pipeline/src/main/proto/beam_runner_api.proto#L542-L555
> >
>