+1 for removing apply_* For the Java SDK, removing specialized intercepts was an important first step towards the portability framework. I wonder if there is a way for the Python SDK to leapfrog, taking advantage of some of the lessons that Java learned a bit more painfully. Most pertinent I think is that if an SDK's role is to construct a pipeline and ship the proto to a runner (service) then overrides apply to a post-deserialization pipeline. The Java DirectRunner does a proto round-trip to avoid accidentally depending on things that are not really part of the pipeline. I would this crisp abstraction enforcement would add even more value to Python.
Kenn On Thu, Feb 1, 2018 at 5:21 PM, Charles Chen <[email protected]> wrote: > In the Python DirectRunner, we currently use apply_* overrides to override > the operation of the default .expand() operation for certain transforms. > For example, GroupByKey has a special implementation in the DirectRunner, > so we use an apply_* override hook to replace the implementation of > GroupByKey.expand(). > > However, this strategy has drawbacks. Because this override operation > happens eagerly during graph construction, the pipeline graph is > specialized and modified before a specific runner is bound to the > pipeline's execution. This makes the pipeline graph non-portable and blocks > full migration to using the Runner API pipeline representation in the > DirectRunner. > > By contrast, the SDK's PTransformOverride mechanism allows the expression > of matchers that operate on the unspecialized graph, replacing PTransforms > as necessary to produce a DirectRunner-specialized pipeline graph for > execution. > > I therefore propose to replace these eager apply_* overrides with > PTransformOverrides that operate on the completely constructed graph. > > The JIRA issue is https://issues.apache.org/jira/browse/BEAM-3566, and > I've prepared a candidate patch at https://github.com/apache/ > incubator-beam/pull/4529. > > Best, > Charles >
