Is there a guideline about how the name provided in `PCollection.apply( String name, PTransform<? super PCollection<Row>, PCollection<Row>> t)` is adopted in different runners? I suppose that should be the option, to have a readable graph for all runners, instead of 'adjust' it to make DataFlow runner works only.
On Thu, Jun 14, 2018 at 8:53 AM, Reuven Lax <[email protected]> wrote: > There was a previous discussion about having generic attributes on > PCollection. Maybe this is a good driving use case? > > On Wed, Jun 13, 2018 at 4:36 PM Kenneth Knowles <[email protected]> wrote: > >> Another thing to consider is that we might return something like a >> "SqlPCollection" that is the PCollection<Row> plus additional metadata that >> is useful to the shell / enumerable converter (such as if the PCollection >> has a known finite size due to LIMIT, even if it is "unbounded", and the >> shell can return control to the user once it receives enough rows). After >> your proposed change this will be much more natural to do, so that's >> another point in favor of the refactor. >> >> Kenn >> >> On Wed, Jun 13, 2018 at 10:22 AM Andrew Pilloud <[email protected]> >> wrote: >> >>> One of my goals is to make the graph easier to read and map back to the >>> SQL EXPLAIN output. The way the graph is currently built (`toPTransform` vs >>> `toPCollection`) does make a big difference in that graph. I think it is >>> also important to have a common function to do the apply with consistent >>> naming. I think that will greatly help with ease of understanding. It >>> sounds like what really want is this in the BeamRelNode interface: >>> >>> PInput buildPInput(Pipeline pipeline); >>> PTransform<PInput, PCollection<Row>> buildPTransform(); >>> >>> default PCollection<Row> toPCollection(Pipeline pipeline) { >>> return buildPInput(pipeline).apply(getStageName(), >>> buildPTransform()); >>> } >>> >>> Andrew >>> >>> On Mon, Jun 11, 2018 at 2:27 PM Mingmin Xu <[email protected]> wrote: >>> >>>> EXPLAIN shows the execution plan in SQL perspective only. After >>>> converting to a Beam composite PTransform, there're more steps underneath, >>>> each Runner re-org Beam PTransforms again which makes the final pipeline >>>> hard to read. In SQL module itself, I don't see any difference between >>>> `toPTransform` and `toPCollection`. We could have an easy-to-understand >>>> step name when converting RelNodes, but Runners show the graph to >>>> developers. >>>> >>>> Mingmin >>>> >>>> On Mon, Jun 11, 2018 at 2:06 PM, Andrew Pilloud <[email protected]> >>>> wrote: >>>> >>>>> That sounds correct. And because each rel node might have a different >>>>> input there isn't a standard interface (like PTransform<PCollection< >>>>> Row>, PCollection<Row>> toPTransform()); >>>>> >>>>> Andrew >>>>> >>>>> On Mon, Jun 11, 2018 at 1:31 PM Kenneth Knowles <[email protected]> >>>>> wrote: >>>>> >>>>>> Agree with that. It will be kind of tricky to generalize. I think >>>>>> there are some criteria in this case that might apply in other cases: >>>>>> >>>>>> 1. Each rel node (or construct of a DSL) should have a PTransform for >>>>>> how it computes its result from its inputs. >>>>>> 2. The inputs to that PTransform should actually be the inputs to the >>>>>> rel node! >>>>>> >>>>>> So I tried to improve #1 but I probably made #2 worse. >>>>>> >>>>>> Kenn >>>>>> >>>>>> On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Not answering the original question, but doesn't "explain" satisfy >>>>>>> the SQL use case? >>>>>>> >>>>>>> Going forward we probably want to solve this in a more general way. >>>>>>> We have at least 3 ways to represent the pipeline: >>>>>>> - how runner executes it; >>>>>>> - what it looks like when constructed; >>>>>>> - what the user was describing in DSL; >>>>>>> And there will probably be more, if extra layers are built on top of >>>>>>> DSLs. >>>>>>> >>>>>>> If possible, we probably should be able to map any level of >>>>>>> abstraction to any other to better understand and debug the pipelines. >>>>>>> >>>>>>> >>>>>>> On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> In other words, revert https://github.com/ >>>>>>>> apache/beam/pull/4705/files, at least in spirit? I agree :-) >>>>>>>> >>>>>>>> Kenn >>>>>>>> >>>>>>>> On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> We are currently converting the Calcite Rel tree to Beam by >>>>>>>>> recursively building a tree of nested PTransforms. This results in a >>>>>>>>> weird >>>>>>>>> nested graph in the dataflow UI where each node contains its inputs >>>>>>>>> nested >>>>>>>>> inside of it. I'm going to change the internal data structure for >>>>>>>>> converting the tree from a PTransform to a PCollection, which will >>>>>>>>> result >>>>>>>>> in a more accurate representation of the tree structure being built >>>>>>>>> and >>>>>>>>> should simplify the code as well. This will not change the public >>>>>>>>> interface >>>>>>>>> to SQL, which will remain a PTransform. Any thoughts or objections? >>>>>>>>> >>>>>>>>> I was also wondering if there are tools for visualizing the Beam >>>>>>>>> graph aside from the dataflow runner UI. What other tools exist? >>>>>>>>> >>>>>>>>> Andrew >>>>>>>>> >>>>>>>> >>>> >>>> >>>> -- >>>> ---- >>>> Mingmin >>>> >>> -- ---- Mingmin
