Hi All,
I am currently investigating making the Python DataflowRunner to use a
portable pipeline representation so that we can eventually get rid of the
Pipeline(runner) weirdness.
In that case, I have a lot questions about the Python DataflowRunner:
*PValueCache*
- Why does this exist?
*DataflowRunner*
- I see that the DataflowRunner defines some PTransforms as
runner-specific primitives by returning a PCollection.from_(...) in apply_
methods. Then in the run_ methods, it references the PValueCache to add
steps.
- How does this add steps?
- Where does it cache the values to?
- How does the runner harness pick up these cached values to create
new steps?
- How is this information communicated to the runner harness?
- Why do the following transforms need to be overridden: GroupByKey,
WriteToBigQuery, CombineValues, Read?
- Why doesn't the ParDo transform need to be overridden? I see that it
has a run_ method but no apply_ method.
*Possible fixes*
I was thinking of getting rid of the apply_ and run_ methods and replacing
those with a PTransformOverride and a simple PipelineVisitor, respectively.
Is this feasible? Am I missing any assumptions that don't make this
feasible?
Regards,
Sam