+1
The format to store coders is not set in stone, it was a first version
to make external configuration work. Using the Coder message would be
better.
As for using Schema to store the configuration, could somebody fill me
in how that would work?
-Max
On 04.08.20 02:01, Brian Hulette wrote:
I've opened BEAM-10571 [1] for this, and I'm most of the way to an
implementation now. Aiming to have it done before the 2.24.0 cut since
it will be the last release with python 2 support.
[1] https://issues.apache.org/jira/browse/BEAM-10571
On Wed, Jul 15, 2020 at 9:03 AM Chamikara Jayalath <[email protected]
<mailto:[email protected]>> wrote:
On Fri, Jul 10, 2020 at 4:47 PM Robert Bradshaw <[email protected]
<mailto:[email protected]>> wrote:
On Fri, Jul 10, 2020 at 4:36 PM Brian Hulette
<[email protected] <mailto:[email protected]>> wrote:
>
> Ah yes I'm +1 for that approach too - it would let us
leverage all the schema-inference already in the Java SDK for
translating configuration objects which would be great.
> Things on the Python side would be trickier as schemas don't
formally support all the types you can use in the PayloadBuilder
implementations [1] yet, just NamedTuple. For now we could just
make the PayloadBuilder implementations generate Rows without
making that translation available for use in PCollections.
This will be a good opportunity to add some sort of a minimal Python
type to Beam schema mapping :)
Yes, though eventually it might be nice to support all of these
various types as schema'd PCollection elements as well.
> Do we need to worry about update compatibility for
ExternalConfigurationPayload?
Technically, each URN defines their payload, and the fact that we've
settled on ExternalConfigurationPayload is a convention. On a
practical note, we haven't declared these protos stable yet. (I
would
like to do so before we drop support for Python 2, as external
transforms are a possible escape hatch and the first strong
motivation
to have external transforms that span Beam versions).
+1
> [1]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/external.py
>
> On Fri, Jul 10, 2020 at 4:23 PM Robert Bradshaw
<[email protected] <mailto:[email protected]>> wrote:
>>
>> I would be in favor of just using a schema to store the entire
>> configuration. The reason we went with what we have to day
is that we
>> didn't have cross language schemas yet.
>>
>> On Fri, Jul 10, 2020 at 12:24 PM Brian Hulette
<[email protected] <mailto:[email protected]>> wrote:
>> >
>> > Hi everyone,
>> > I noticed that currently the ExternalConfigurationPayload
uses a list of coder URNs to represent the coder that was used
to serialize each configuration field [1]. This seems acceptable
at first blush, but there's one notable issue: it has no place
to store a payload for the coder. Most standard coders don't use
a payload so it's not a problem, but row coder does use a
payload to store it's schema, which means it can't be used in an
ExternalConfigurationPayload today.
>> >
>> > Is there a reason not to just use the Coder message [2] in
ExternalConfigurationPayload instead of a list of coder URNs?
That would work with row coder, and it would also make it easier
to re-use logic for translating Pipeline protos.
>> >
>> > I'd be happy to make this change, but I wanted to ask on
dev@ in case there's something I'm missing here.
>> >
>> > Brian
>> >
>> > [1]
https://github.com/apache/beam/blob/c54a0b7f49f2eb4a15df115205e2fa455116ccbe/model/pipeline/src/main/proto/external_transforms.proto#L34-L35
>> > [2]
https://github.com/apache/beam/blob/c54a0b7f49f2eb4a15df115205e2fa455116ccbe/model/pipeline/src/main/proto/beam_runner_api.proto#L542-L555