On Wed, Apr 28, 2021 at 12:16 PM Rui Wang <ruw...@google.com> wrote:

> >Could you point me out why "non equi-join” can’t be supported? Either it
> can and this is just a question of implementation?
>
> It is a question of implementation. As assuming join are implemented by
> CoGBK, non-equi-join probably means you have to generate the key space and
> then use CoGBK (which is equi-join) to do the join.
>

I think joins are a big unexplored space (for Beam). We should learn from
other projects with a variety of join algorithms that are not available in
the Java SDK. This will be most valuable, and then SQL can take advantage
of it too.


> >I’m curious what is a current implementation of "ORDER BY LIMIT” and can
> it be applied, at least, to only Bounded collection/Global window in the
> same way for "ORDER BY" without limits?
>
> IIRC, The implementation is based on TOP transform. I think the real
> question is when support only ORDER BY, e.g. for Bounded collection/Global
> window, is useful?
>

+1 to this question. It is why we never implemented it or added it to the
model. Global ordering typically cannot even be observed since by one
definition "big data" means it cannot all be observed. If the TPC-DS
queries are using ORDER BY without TOP then we should see why. After
looking at a few examples we may have some idea for how to produce a plan,
or we may decide the query is not useful as-is.

>I have one related question. Would we be able to apply SQL specific
> optimizations that apply only to batch only pipelines? Asking this because
> I can imagine that covering the full Beam model should constraint the
> optimization possibilities no?
>
> I am not sure if we can see a pipeline is batch only during the SQL
> optimization process. But as I recall we can see if inputs are
> bounded/unbounded, and probably we can only limit some optimizations only
> for bounded PCollection.
>

There is no such thing as a batch pipeline, but there is a pipeline with
only bounded PCollections :-)

We can see that a PCollection is bounded in SQL and Java. This could be
made an attribute on a relation so the optimizer would know. Often a naive
compilation to Beam and then letting Beam runner optimize is better, since
Beam is already high level and usually doesn't care about bounded/unbounded
and runners produce true physical plans. For joins we need the
bounded/unbounded information for sure.

Kenn


>
> On Wed, Apr 28, 2021 at 9:33 AM Alexey Romanenko <aromanenko....@gmail.com>
> wrote:
>
>>
>>
>> Cause 4 looks like no such function found in the catalog.
>>
>>
>> I guess it should be
>> *"SUBSTRING(<CHARACTER> FROM <NUMERIC> FOR <NUMERIC>)”* instead of 
>> *"substr(<CHARACTER>,
>> <NUMERIC>, <NUMERIC>)”* ?
>>
>>
>> Well, s/*substr/**substring/ *seems fixes this problem.
>>
>> —
>> Alexey
>>
>>
>>> I’m not very familiar with a current status of ongoing work for Beam
>>> SQL, so I’m sorry in advance if my questions will sound naive.
>>>
>>> Please, guide me on this:
>>>
>>> 1. Are there any chances that we can resolve, at least, partly the
>>> current limitations of the query parsing/planning, mentioned above? Are
>>> there any principal blockers among them?
>>> 2. Are there any plans or ongoing work related to this?
>>> 3. Are there any plans to upgrade vendored Calcite version to more
>>> recent one? Should it reduce the number of current limitations or not?
>>> 4. Do you think it could be valuable for Beam SQL to run TPC-DS
>>> benchmark on a regular basis (as we do for Nexmark, for example) even if
>>> not all queries can pass with Beam SQL?
>>>
>>
>> This is definitely valuable for BeamSQL if we have enough resources to
>> run such queries regularly.
>>
>>>
>>> I’d appreciate any additional information/docs/details/opinions on this
>>> topic.
>>>
>>> —
>>> Alexey
>>>
>>> [1] https://github.com/apache/beam/tree/master/sdks/java/testing/tpcds
>>> [2] http://www.tpc.org/tpcds/
>>> [3]
>>> https://docs.google.com/spreadsheets/d/1Gya9Xoa6uWwORHSrRqpkfSII4ajYvDpUTt0cNJCRHjE/edit?usp=sharing
>>> [4]
>>> https://github.com/apache/beam/tree/master/sdks/java/testing/tpcds/src/main/resources/queries
>>>
>>
>>
>>

Reply via email to