On Mon, Mar 11, 2019 at 4:37 PM Maximilian Michels <m...@apache.org> wrote:
>
> > Just to clarify. What's the reason for including a PROPERTIES enum here 
> > instead of directly making beam_urn a field of ExternalTransformPayload ?
>
> The URN is supposed to be static. We always use the same URN for this
> type of external transform. We probably want an additional identifier to
> point to the resource we want to configure.

It does feel odd to not use the URN to specify the transform itself,
and embed the true identity in an inner proto. The notion of
"external" is just how it happens to be invoked in this pipeline, not
part of its intrinsic definition. As we want introspection
capabilities in the service, we should be able to use the URN at a top
level and know what kind of payload it expects. I would also like to
see this kind of information populated for non-extern transforms which
could be good for visibility (substitution, visualization, etc.) for
runners and other pipeline-consuming tools.

> Like so:
>
> message ExternalTransformPayload {
>    enum Enum {
>      PROPERTIES = 0
>          [(beam_urn) = "beam:external:transform:external_transform:v1"];
>    }
>    // A fully-qualified identifier, e.g. Java package + class
>    string identifier = 1;

I'd rather the identifier have semantic rather than
implementation-specific meaning. e.g. one could imagine multiple
implementations of a given transform that different services could
offer.

>    // the format may change to map<string, bytes> if types are supported
>    map<string, string> parameters = 2;
> }
>
> The identifier could also be a URN.
>
> > Can we change first version to map<string, bytes> ? Otherwise the set of 
> > transforms we can support/test will be very limited.
>
> How do we do that? Do we define a set of standard coders for supported
> types? On the Java side we can lookup the coder by extracting the field
> from the Pojo, but we can't do that in Python.
>
> > Can we re-use some of the Beam schemas-related work/utilities here ?
>
> Yes, that was the plan.

On this note, Reuven, what is the plan (and timeline) for a
language-independent representation of schemas? The crux of the
problem is that the user needs to specify some kind of configuration
(call it C) to construct the transform (call it T). This would be
handled by a TransformBuilder<C, T> that provides (at least) a mapping
C -> T. (Possibly this interface could be offered on the transform
itself).

The question we are trying to answer here is how to represent C, in
both the source and target language, and on the wire. The idea is that
we could leverage the schema infrastructure such that C could be a
POJO in Java (and perhaps a dict in Python). We would want to extend
Schemas and Row (or perhaps a sub/super/sibling class thereof) to
allow for Coder and UDF-typed fields. (Exactly how to represent UDFs
is still very TBD.) The payload for a external transform using this
format would be the tuple (schema, SchemaCoder(schema).encode(C)). The
goal is to not, yet again, invent a cross-language way of defining a
bag of named, typed parameters (aka fields) with language-idiomatic
mappings and some introspection capabilities, and significantly less
heavy-weight than users defining their own protos (plus generating
bindings to all languages).

Does this seem a reasonable use of schemas?

Reply via email to