On Tue, Mar 31, 2020 at 12:06 PM Sam Rohde <sro...@google.com> wrote:

> Hi All,
>
> I am currently investigating making the Python DataflowRunner to use a
> portable pipeline representation so that we can eventually get rid of the
> Pipeline(runner) weirdness.
>
> In that case, I have a lot questions about the Python DataflowRunner:
>
> *PValueCache*
>
>    - Why does this exist?
>
> This is historical baggage from the (long gone) first direct runner when
actual computed PCollections were cached, and the DataflowRunner inherited
it.


> *DataflowRunner*
>
>    - I see that the DataflowRunner defines some PTransforms as
>    runner-specific primitives by returning a PCollection.from_(...) in apply_
>    methods. Then in the run_ methods, it references the PValueCache to add
>    steps.
>       - How does this add steps?
>       - Where does it cache the values to?
>       - How does the runner harness pick up these cached values to create
>       new steps?
>       - How is this information communicated to the runner harness?
>    - Why do the following transforms need to be overridden: GroupByKey,
>    WriteToBigQuery, CombineValues, Read?
>
> Each of these four has a different implementation on Dataflow.

>
>    - Why doesn't the ParDo transform need to be overridden? I see that it
>    has a run_ method but no apply_ method.
>
> apply_ is called at pipeline construction time, all of these should be
replaced by PTransformOverrides. run_ is called after pipeline construction
to actually build up the dataflow graph.


> *Possible fixes*
> I was thinking of getting rid of the apply_ and run_ methods and replacing
> those with a PTransformOverride and a simple PipelineVisitor, respectively.
> Is this feasible? Am I missing any assumptions that don't make this
> feasible?
>

If we're going to overhaul how the runner works, it would be best to make
DataflowRunner direct a translator from Beam runner api protos to Cloudv1b3
protos, rather than manipulate the intermediate Python representation
(which no one wants to change for fear of messing up DataflowRunner and
cause headaches for cross langauge).

Reply via email to