Re: Building and visualizing the Beam SQL graph

Mingmin Xu Thu, 14 Jun 2018 11:12:36 -0700

Is there a guideline about how the name provided in `PCollection.apply(
String name, PTransform<? super PCollection<Row>, PCollection<Row>> t)` is
adopted in different runners? I suppose that should be the option, to have
a readable graph for all runners, instead of 'adjust' it to make DataFlow
runner works only.


On Thu, Jun 14, 2018 at 8:53 AM, Reuven Lax <[email protected]> wrote:

> There was a previous discussion about having generic attributes on
> PCollection. Maybe this is a good driving use case?
>
> On Wed, Jun 13, 2018 at 4:36 PM Kenneth Knowles <[email protected]> wrote:
>
>> Another thing to consider is that we might return something like a
>> "SqlPCollection" that is the PCollection<Row> plus additional metadata that
>> is useful to the shell / enumerable converter (such as if the PCollection
>> has a known finite size due to LIMIT, even if it is "unbounded", and the
>> shell can return control to the user once it receives enough rows). After
>> your proposed change this will be much more natural to do, so that's
>> another point in favor of the refactor.
>>
>> Kenn
>>
>> On Wed, Jun 13, 2018 at 10:22 AM Andrew Pilloud <[email protected]>
>> wrote:
>>
>>> One of my goals is to make the graph easier to read and map back to the
>>> SQL EXPLAIN output. The way the graph is currently built (`toPTransform` vs
>>> `toPCollection`) does make a big difference in that graph. I think it is
>>> also important to have a common function to do the apply with consistent
>>> naming. I think that will greatly help with ease of understanding. It
>>> sounds like what really want is this in the BeamRelNode interface:
>>>
>>> PInput buildPInput(Pipeline pipeline);
>>> PTransform<PInput, PCollection<Row>> buildPTransform();
>>>
>>> default PCollection<Row> toPCollection(Pipeline pipeline) {
>>>     return buildPInput(pipeline).apply(getStageName(),
>>> buildPTransform());
>>> }
>>>
>>> Andrew
>>>
>>> On Mon, Jun 11, 2018 at 2:27 PM Mingmin Xu <[email protected]> wrote:
>>>
>>>> EXPLAIN shows the execution plan in SQL perspective only. After
>>>> converting to a Beam composite PTransform, there're more steps underneath,
>>>> each Runner re-org Beam PTransforms again which makes the final pipeline
>>>> hard to read. In SQL module itself, I don't see any difference between
>>>> `toPTransform` and `toPCollection`. We could have an easy-to-understand
>>>> step name when converting RelNodes, but Runners show the graph to
>>>> developers.
>>>>
>>>> Mingmin
>>>>
>>>> On Mon, Jun 11, 2018 at 2:06 PM, Andrew Pilloud <[email protected]>
>>>> wrote:
>>>>
>>>>> That sounds correct. And because each rel node might have a different
>>>>> input there isn't a standard interface (like PTransform<PCollection<
>>>>> Row>, PCollection<Row>> toPTransform());
>>>>>
>>>>> Andrew
>>>>>
>>>>> On Mon, Jun 11, 2018 at 1:31 PM Kenneth Knowles <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Agree with that. It will be kind of tricky to generalize. I think
>>>>>> there are some criteria in this case that might apply in other cases:
>>>>>>
>>>>>> 1. Each rel node (or construct of a DSL) should have a PTransform for
>>>>>> how it computes its result from its inputs.
>>>>>> 2. The inputs to that PTransform should actually be the inputs to the
>>>>>> rel node!
>>>>>>
>>>>>> So I tried to improve #1 but I probably made #2 worse.
>>>>>>
>>>>>> Kenn
>>>>>>
>>>>>> On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Not answering the original question, but doesn't "explain" satisfy
>>>>>>> the SQL use case?
>>>>>>>
>>>>>>> Going forward we probably want to solve this in a more general way.
>>>>>>> We have at least 3 ways to represent the pipeline:
>>>>>>>  - how runner executes it;
>>>>>>>  - what it looks like when constructed;
>>>>>>>  - what the user was describing in DSL;
>>>>>>> And there will probably be more, if extra layers are built on top of
>>>>>>> DSLs.
>>>>>>>
>>>>>>> If possible, we probably should be able to map any level of
>>>>>>> abstraction to any other to better understand and debug the pipelines.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> In other words, revert https://github.com/
>>>>>>>> apache/beam/pull/4705/files, at least in spirit? I agree :-)
>>>>>>>>
>>>>>>>> Kenn
>>>>>>>>
>>>>>>>> On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> We are currently converting the Calcite Rel tree to Beam by
>>>>>>>>> recursively building a tree of nested PTransforms. This results in a 
>>>>>>>>> weird
>>>>>>>>> nested graph in the dataflow UI where each node contains its inputs 
>>>>>>>>> nested
>>>>>>>>> inside of it. I'm going to change the internal data structure for
>>>>>>>>> converting the tree from a PTransform to a PCollection, which will 
>>>>>>>>> result
>>>>>>>>> in a more accurate representation of the tree structure being built 
>>>>>>>>> and
>>>>>>>>> should simplify the code as well. This will not change the public 
>>>>>>>>> interface
>>>>>>>>> to SQL, which will remain a PTransform. Any thoughts or objections?
>>>>>>>>>
>>>>>>>>> I was also wondering if there are tools for visualizing the Beam
>>>>>>>>> graph aside from the dataflow runner UI. What other tools exist?
>>>>>>>>>
>>>>>>>>> Andrew
>>>>>>>>>
>>>>>>>>
>>>>
>>>>
>>>> --
>>>> ----
>>>> Mingmin
>>>>
>>>


-- 
----
Mingmin

Re: Building and visualizing the Beam SQL graph

Reply via email to