Re: joins and low selectivity optimization

Stamatis Zampetakis Wed, 20 Feb 2019 01:24:14 -0800

Hi Andrei,

I am not I understand why bloom-filters and block based nested loop are not
possible for your use case but I will try to provide some answers to the
new questions you raised.

By adding streams in the discussion I guess you add some additional
limitations on one side of the join (Orders table):

(i) iterators/enumerators cannot be restarted;
(ii) iterators/enumerators are infinite;
(iii) there are no indexes.

Andrei> 1. Am I correct to assume that each event in Orders table (which is
a stream) will trigger full table scan (without filter) on Products table?

The limitations imposed by streams already exclude some join options but
still it doesn't mean that there is only one way to execute the join.
So the answer to this question is in general no. Consider for instance that
the products table is small (and fits in memory) then a
hash join algorithm could be used where you build a hash table based on
products and then probe the hash table with the stream relation.

I would suspect that the default rule set in Calcite chooses this join
option by default (i.e., EnumerableJoin).

Andrei> 2. Can I register my custom rule to rewrite the query when, say,
Orders and Products tables are present to manually add a sub query ?

It seems that this kind of optimizations are more appropriate at the
physical level so if you end up writting custom rules I think it may be
better
to target a physical convention (e.g., Enumerable relnodes). Moreover, it
seems like you want to introduce the equivalent of semi-join reducers so
in principle you should introduce something similar to EnumerableSemiJoin.
I am saying similar because the current implementation of EnumerableSemiJoin
is almost the same to the EnumerableJoin (i.e., hash-join based) so I would
assume that you are not going to gain match by doing so.

Andrei> 3. Do I have to disable SubQueryRemoveRule in this case ?

If I remember correctly the rule tries to transform correlates to joins so
I guess it depends on what kind of physical join algorithm you want to use.
Correlate aka., nested loop join can be a great alternative in some cases
as other people mentioned before.

Andrei> 4. Vadym, not sure how sub-query computation will work. Can I
partially execute the query and convert the subquery into EnumerableValues ?

Not sure what do you mean here. If you write/choose (by custom rules) the
algorithm your self you can control exactly what is happening and how the
query will be executed.

Andrei>Is there a way to solve this problem non-generically ?
We’re also hitting this limitation in Flink (which uses calcite but not
calcite streams) for similar use-case.

It seems that you have some concrete query plans (using built-in calcite &
other operators) which are not performing well.
Maybe it would be helpful if you could share the problematic plan in this
discussion.

Best,
Stamatis

Στις Τετ, 20 Φεβ 2019 στις 1:32 π.μ., ο/η Andrei Sereda <and...@sereda.cc>
έγραψε:

> Hello,
>
> I would like to resurrect this thread in the context of calcite streams.
> Unfortunately bloom-filters is not an option for the data-sources being
> used.
>
> Say one has stream to table join
> <https://calcite.apache.org/docs/stream.html#joining-streams-to-tables>.
> From docs example:
>
> SELECT STREAM
>     o.productId, o.orderId, o.units, p.name, p.unitPrice FROM Orders
> AS o -- streamable Table JOIN Products AS p -- reference data table
>     ON o.productId = p.productId;
>
>
>    1. Am I correct to assume that each event in Orders table (which is a
>    stream) will trigger full table scan (without filter) on Products table
>    ?
>    2. Can I register my custom rule to rewrite the query when, say, Orders
>    and Products tables are present to manually add a sub query ?
>    3. Do I have to disable SubQueryRemoveRule in this case ?
>    4. Vadym, not sure how sub-query computation will work. Can I partially
>    execute the query and convert the subquery into EnumerableValues ?
>
> Is there a way to solve this problem non-generically ?
>
> We’re also hitting this limitation in Flink (which uses calcite but not
> calcite streams) for similar use-case.
>
> Many Thanks,
> Andrei.
>
> On Thu, Aug 30, 2018 at 5:27 PM Vineet Garg <vg...@hortonworks.com> wrote:
>
> > Hive actually does this optimization (it is called semi-join reduction)
> by
> > generating bloom-filters on one side and passing it on to the other side.
> > This is not a rewrite but instead a physical implementation.
> >
> > Vineet
> >
> > On Aug 29, 2018, at 10:34 AM, Vladimir Sitnikov <
> > sitnikov.vladi...@gmail.com<mailto:sitnikov.vladi...@gmail.com>> wrote:
> >
> > Nested loops are never likely to happe
> >
> > What's wrong with that?
> >
> > Apparently Andrei asks for that, and "subquery precomputation" is quite
> > close to nested loops in my opinion.
> >
> > Vladimir
> >
> >
>

Re: joins and low selectivity optimization

Reply via email to