Unportable Dataflow Pipeline Questions

Sam Rohde Tue, 31 Mar 2020 12:07:14 -0700

Hi All,

I am currently investigating making the Python DataflowRunner to use a
portable pipeline representation so that we can eventually get rid of the
Pipeline(runner) weirdness.


In that case, I have a lot questions about the Python DataflowRunner:

*PValueCache*

   - Why does this exist?

*DataflowRunner*

   - I see that the DataflowRunner defines some PTransforms as
   runner-specific primitives by returning a PCollection.from_(...) in apply_
   methods. Then in the run_ methods, it references the PValueCache to add
   steps.
      - How does this add steps?
      - Where does it cache the values to?
      - How does the runner harness pick up these cached values to create
      new steps?
      - How is this information communicated to the runner harness?
   - Why do the following transforms need to be overridden: GroupByKey,
   WriteToBigQuery, CombineValues, Read?
   - Why doesn't the ParDo transform need to be overridden? I see that it
   has a run_ method but no apply_ method.

*Possible fixes*
I was thinking of getting rid of the apply_ and run_ methods and replacing
those with a PTransformOverride and a simple PipelineVisitor, respectively.
Is this feasible? Am I missing any assumptions that don't make this
feasible?

Regards,
Sam

Unportable Dataflow Pipeline Questions

Reply via email to