One of the approaches to such queries is to throw Bloom filters all over
the place.

That is it could execute "small side" of the join, collect the ids (or a
lossy version of it in a form of Bloom filters),
and it could propagate that Bloom filter to the second source to reduce the
set of rows produced by the second row source.
Then the join would be easier to do since the second row source is reduced.

The sad thing is not all systems support propagation of bloom filters.

>select *from
>  t1 join t2 on (t1.id = t2.id)where
>  t2.id in (select id from t1) -- force sub selec

What if Calcite did just a regular batched nested loop join?
That is:
1. Fetch next 10 rows from t1
2. Fetch "from t2 where id in (...)"
3. goto 1

It can be expressed via correlated subqueries, however:
a) I'm not sure correlated subqueries work great at the moment
b) Support for "batched" correlated execution is likely not there
c) Calcite should somehow know the true cost of "from t2 where id in (1,2)"
vs "from t2 where id in (1,2,3,4)". In other words, current costing model
does not take into account if the table has index or not. One can code such
costing rules, however I think it is not there yet.

Vladimir

Reply via email to