Regarding Vladimir’s ideas of Bloom filters and nested loop joins. Both are 
excellent if you can do them. They are fairly easy in single-node architectures 
(especially single-threaded) but get harder in distributed architectures. Bloom 
filters (also magic sets) require data to be pushed “up stream”, and may 
require re-starting sub-graphs.

So, you have to devise query processing algorithms that are appropriate for 
your architecture.

Hive is an example of a highly distributed, parallel engine. Hive would like to 
do Bloom filters but has still not gotten around to it. Nested loops are never 
likely to happen. But Hive uses other techniques.

Julian


> On Aug 29, 2018, at 8:37 AM, Andrei Sereda <and...@sereda.cc> wrote:
> 
> Hi Vladimir,
> 
> Thanks for follow-up and explanation. I wanted to make sure I'm not missing
> (mis-understanding) anything.
> 
> Andrei.
> 
> 
> 
> On Wed, Aug 29, 2018 at 11:01 AM Vladimir Sitnikov <
> sitnikov.vladi...@gmail.com> wrote:
> 
>> One of the approaches to such queries is to throw Bloom filters all over
>> the place.
>> 
>> That is it could execute "small side" of the join, collect the ids (or a
>> lossy version of it in a form of Bloom filters),
>> and it could propagate that Bloom filter to the second source to reduce the
>> set of rows produced by the second row source.
>> Then the join would be easier to do since the second row source is reduced.
>> 
>> The sad thing is not all systems support propagation of bloom filters.
>> 
>>> select *from
>>> t1 join t2 on (t1.id = t2.id)where
>>> t2.id in (select id from t1) -- force sub selec
>> 
>> What if Calcite did just a regular batched nested loop join?
>> That is:
>> 1. Fetch next 10 rows from t1
>> 2. Fetch "from t2 where id in (...)"
>> 3. goto 1
>> 
>> It can be expressed via correlated subqueries, however:
>> a) I'm not sure correlated subqueries work great at the moment
>> b) Support for "batched" correlated execution is likely not there
>> c) Calcite should somehow know the true cost of "from t2 where id in (1,2)"
>> vs "from t2 where id in (1,2,3,4)". In other words, current costing model
>> does not take into account if the table has index or not. One can code such
>> costing rules, however I think it is not there yet.
>> 
>> Vladimir
>> 

Reply via email to