Re: XLang sub-graph representation within the SDKs pipeline types

Robert Burke Wed, 01 Jul 2020 19:19:08 -0700

>From the Go SDK side, it was built that way nearly from the start.
Historically there was a direct SDK rep -> Dataflow rep conversion, but
that's been replaced with a SDK rep -> Beam Proto -> Dataflow rep
conversion.

In particular, this approach had a few benefits: easier to access local
context for pipeline validation at construction time, to permit as early a
failure as possible, which might be easier with native language constructs
vs beam representations of them.(Eg. DoFns not matching ParDo & Collection
types, and similar)
Protos are convenient, but impose certain structure on how the pipeline
graph is handled. (This isn't to say an earlier conversion isn't possible,
one can do almost anything in code, but it lets the structure be optimised
for this case.)

The big advantage of translating from Beam proto -> to Dataflow Rep is that
the Dataflow Rep can get the various unique IDs that are mandated for the
Beam proto process.

However, the same can't really be said for the other way around.  A good
question is "when should the unique IDs be assigned?"

While I'm not working on adding XLang to the Go SDK directly (that would be
our wonderful intern, Kevin),  I've kind of pictured that the process was
to provide the Expansion service with unique placeholders if unable to
provide the right IDs, and substitute them in returned pipeline graph
segment afterwards, once that is known. That is, we can be relatively
certain that the expansion service will be self consistent, but it's the
SDK requesting the expansion's responsibility to ensure they aren't
colliding with the primary SDKs pipeline ids.

Otherwise, we could probably recommend a translation protocol (if one
doesn't exist already, it probably does) and when XLang expansions are to
happen in the SDK -> beam proto process. So something like Pass 1, intern
all coders and Pcollections, Pass 2 intern all DoFns and environments, Pass
3 expand Xlang, ... Etc.

The other half of this is when happens when Going from Beam proto a
-> SDK? This happens during pipeline execution, but at least in the Go SDK
partly happens when creating the Dataflow rep. In particular, Coder
reference values only have a populated ID when they've been "rehydrated"
from the Beam proto, since the Beam Proto is the first place where such IDs
are correctly assigned.

Tl;dr; i think the right question to sort out is when should IDs be
expected to be assigned and available during pipeline construction.

On Wed, Jul 1, 2020, 6:34 PM Luke Cwik <[email protected]> wrote:

> It seems like we keep running into translation issues with XLang due to
> how it is represented in the SDK. (e.g. Brian's work on context map due to
> loss of coder ids, Heejong's work related to missing environment ids on
> windowing strategies).
>
> I understand that there is an effort that is Dataflow specific where the
> conversion of the Beam proto -> Dataflow API (v1b3) will help with some
> issues but it still requires the SDK pipeline representation -> Beam proto
> to occur correctly which won't be fixed by the Dataflow specific effort.
>
> Why did we go with the current approach?
>
> What other ways could we do this?
>

Re: XLang sub-graph representation within the SDKs pipeline types

Reply via email to