In the Python DirectRunner, we currently use apply_* overrides to override
the operation of the default .expand() operation for certain transforms.
For example, GroupByKey has a special implementation in the DirectRunner,
so we use an apply_* override hook to replace the implementation of
GroupByKey.expand().

However, this strategy has drawbacks. Because this override operation
happens eagerly during graph construction, the pipeline graph is
specialized and modified before a specific runner is bound to the
pipeline's execution. This makes the pipeline graph non-portable and blocks
full migration to using the Runner API pipeline representation in the
DirectRunner.

By contrast, the SDK's PTransformOverride mechanism allows the expression
of matchers that operate on the unspecialized graph, replacing PTransforms
as necessary to produce a DirectRunner-specialized pipeline graph for
execution.

I therefore propose to replace these eager apply_* overrides with
PTransformOverrides that operate on the completely constructed graph.

The JIRA issue is https://issues.apache.org/jira/browse/BEAM-3566, and I've
prepared a candidate patch at
https://github.com/apache/incubator-beam/pull/4529.

Best,
Charles

Reply via email to