Re: Building and visualizing the Beam SQL graph

Andrew Pilloud Mon, 18 Jun 2018 14:47:32 -0700

The name can't just be a constant string as it needs to be unique for every
node in the graph. We currently build the name as the rel node class name
plus the unique ID from the calcite graph (plus a unique ID from the beam
graph, but that goes away with some bug fixes).


I've written this up into a PR. https://github.com/apache/beam/pull/5673

Andrew

On Fri, Jun 15, 2018 at 7:51 AM Kenneth Knowles <[email protected]> wrote:

> @Reuven: I think DSLs are better served by having their own wrappers than
> by putting their data into generic attributes. They would need attributes
> if they needed to put them in and have them come back out, but usually the
> DSL has a higher-level view and no need for Beam to propagate data on its
> behalf, in fact it is simpler to do it directly at the DSL level. That is
> the case for SQL and LIMIT.
>
> @Mingmin: Agree. The name on each node is the portable way to describe
> transforms. It should be locally unique and the composite structure makes
> them globally unique. Do all our runners use it to make their UIs pretty? I
> don't know. It would be great to check on that and improve it.
>
> @Andrew: Do we really want getStageName()? Can it just be a constant
> string, with the composite structure giving context?
>
> Kenn
>
> On Thu, Jun 14, 2018 at 11:12 AM Mingmin Xu <[email protected]> wrote:
>
>> Is there a guideline about how the name provided in `PCollection.apply(
>> String name, PTransform<? super PCollection<Row>, PCollection<Row>> t)`
>> is adopted in different runners? I suppose that should be the option, to
>> have a readable graph for all runners, instead of 'adjust' it to make
>> DataFlow runner works only.
>>
>> On Thu, Jun 14, 2018 at 8:53 AM, Reuven Lax <[email protected]> wrote:
>>
>>> There was a previous discussion about having generic attributes on
>>> PCollection. Maybe this is a good driving use case?
>>>
>>> On Wed, Jun 13, 2018 at 4:36 PM Kenneth Knowles <[email protected]> wrote:
>>>
>>>> Another thing to consider is that we might return something like a
>>>> "SqlPCollection" that is the PCollection<Row> plus additional metadata that
>>>> is useful to the shell / enumerable converter (such as if the PCollection
>>>> has a known finite size due to LIMIT, even if it is "unbounded", and the
>>>> shell can return control to the user once it receives enough rows). After
>>>> your proposed change this will be much more natural to do, so that's
>>>> another point in favor of the refactor.
>>>>
>>>> Kenn
>>>>
>>>> On Wed, Jun 13, 2018 at 10:22 AM Andrew Pilloud <[email protected]>
>>>> wrote:
>>>>
>>>>> One of my goals is to make the graph easier to read and map back to
>>>>> the SQL EXPLAIN output. The way the graph is currently built
>>>>> (`toPTransform` vs `toPCollection`) does make a big difference in that
>>>>> graph. I think it is also important to have a common function to do the
>>>>> apply with consistent naming. I think that will greatly help with ease of
>>>>> understanding. It sounds like what really want is this in the BeamRelNode
>>>>> interface:
>>>>>
>>>>> PInput buildPInput(Pipeline pipeline);
>>>>> PTransform<PInput, PCollection<Row>> buildPTransform();
>>>>>
>>>>> default PCollection<Row> toPCollection(Pipeline pipeline) {
>>>>>     return buildPInput(pipeline).apply(getStageName(),
>>>>> buildPTransform());
>>>>> }
>>>>>
>>>>> Andrew
>>>>>
>>>>> On Mon, Jun 11, 2018 at 2:27 PM Mingmin Xu <[email protected]> wrote:
>>>>>
>>>>>> EXPLAIN shows the execution plan in SQL perspective only. After
>>>>>> converting to a Beam composite PTransform, there're more steps 
>>>>>> underneath,
>>>>>> each Runner re-org Beam PTransforms again which makes the final pipeline
>>>>>> hard to read. In SQL module itself, I don't see any difference between
>>>>>> `toPTransform` and `toPCollection`. We could have an easy-to-understand
>>>>>> step name when converting RelNodes, but Runners show the graph to
>>>>>> developers.
>>>>>>
>>>>>> Mingmin
>>>>>>
>>>>>> On Mon, Jun 11, 2018 at 2:06 PM, Andrew Pilloud <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> That sounds correct. And because each rel node might have a
>>>>>>> different input there isn't a standard interface (like PTransform<
>>>>>>> PCollection<Row>, PCollection<Row>> toPTransform());
>>>>>>>
>>>>>>> Andrew
>>>>>>>
>>>>>>> On Mon, Jun 11, 2018 at 1:31 PM Kenneth Knowles <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Agree with that. It will be kind of tricky to generalize. I think
>>>>>>>> there are some criteria in this case that might apply in other cases:
>>>>>>>>
>>>>>>>> 1. Each rel node (or construct of a DSL) should have a PTransform
>>>>>>>> for how it computes its result from its inputs.
>>>>>>>> 2. The inputs to that PTransform should actually be the inputs to
>>>>>>>> the rel node!
>>>>>>>>
>>>>>>>> So I tried to improve #1 but I probably made #2 worse.
>>>>>>>>
>>>>>>>> Kenn
>>>>>>>>
>>>>>>>> On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Not answering the original question, but doesn't "explain" satisfy
>>>>>>>>> the SQL use case?
>>>>>>>>>
>>>>>>>>> Going forward we probably want to solve this in a more general
>>>>>>>>> way. We have at least 3 ways to represent the pipeline:
>>>>>>>>>  - how runner executes it;
>>>>>>>>>  - what it looks like when constructed;
>>>>>>>>>  - what the user was describing in DSL;
>>>>>>>>> And there will probably be more, if extra layers are built on top
>>>>>>>>> of DSLs.
>>>>>>>>>
>>>>>>>>> If possible, we probably should be able to map any level of
>>>>>>>>> abstraction to any other to better understand and debug the pipelines.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> In other words, revert
>>>>>>>>>> https://github.com/apache/beam/pull/4705/files, at least in
>>>>>>>>>> spirit? I agree :-)
>>>>>>>>>>
>>>>>>>>>> Kenn
>>>>>>>>>>
>>>>>>>>>> On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> We are currently converting the Calcite Rel tree to Beam by
>>>>>>>>>>> recursively building a tree of nested PTransforms. This results in 
>>>>>>>>>>> a weird
>>>>>>>>>>> nested graph in the dataflow UI where each node contains its inputs 
>>>>>>>>>>> nested
>>>>>>>>>>> inside of it. I'm going to change the internal data structure for
>>>>>>>>>>> converting the tree from a PTransform to a PCollection, which will 
>>>>>>>>>>> result
>>>>>>>>>>> in a more accurate representation of the tree structure being built 
>>>>>>>>>>> and
>>>>>>>>>>> should simplify the code as well. This will not change the public 
>>>>>>>>>>> interface
>>>>>>>>>>> to SQL, which will remain a PTransform. Any thoughts or objections?
>>>>>>>>>>>
>>>>>>>>>>> I was also wondering if there are tools for visualizing the Beam
>>>>>>>>>>> graph aside from the dataflow runner UI. What other tools exist?
>>>>>>>>>>>
>>>>>>>>>>> Andrew
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> ----
>>>>>> Mingmin
>>>>>>
>>>>>
>>
>>
>> --
>> ----
>> Mingmin
>>
>

Re: Building and visualizing the Beam SQL graph

Reply via email to