Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

Haisheng Yuan Tue, 11 Aug 2020 15:39:46 -0700

> I am proposing to use sargs only where we would today use RexCall(IN). The 
> data structures have about the same size. Sargs are sorted. I cannot see any 
> cases that would become more expensive.

IIRC, RexCall(IN) is not supported yet.

With sargs, you have to sort it, no matter explicitly or implicitly, in case 
20k values, it will take some time. I am not saying the data structure itself 
is expensive, but how it is used, it all depends on how downstream projects 
derive stats from it. I have seen Greenplum optimizer experienced performance 
issue for large IN list when deriving all the stats info from the thousands of 
disjoint range intervals. Some downstream project may not even use histogram. 
Some may just use the number of IN constants to derive the number of distinct 
values, I would imagine people have to construct the Sarg back to constant 
list, this is an extra burden. 

Suppose we have col in (1,2,4,6,8,10, 12,13, 15,17,19, 21,22, 24, 24+2.....) 
10k values, now the range sets will be {[1,2], [4,4], ....[10,10], [12,13], 
[15,15]....[21,22], [24,24]....}, in the execution engine, what if I want to do 
the look up for this expression? Find out the points, extract ranges and 
disjoint them, or infer points from integer ranges?

IMHO, the interval constraint is more like a kind of logical property and can 
be strategy for simplification, people can derive the interval constraints from 
the IN list, do simplification on it (but I doubt the value in practice, which 
usually brings more trouble than benefits), but we don't have to represent it 
with sarg in RexNode world.

> Requiring sargs to support geospatial operations seems an unreasonably high 
> bar. Many techniques that we use (sorting, hashing, sketches) do not support 
> geospatial.
> Areas that have not-totally-ordered data types tend to have their own 
> operators already. For your example I’d write ST_Intersects(col, 
> ST_Collect(area1, area2, area3)), and that would be a perfectly suitable 
> representation in RexNode-land.

I am not asking sargs to support geospatial operations. And geospatial 
intersect is just an example of potential customized operation, which are not 
limited to geospatials. It can be some other binary operations. We can't 
require all the customized operation to have something like ST_Collect. 

Oracle support this:
select * from foo where a > ANY(1, 2, 3, ....., 999);
select * from foo where a <= ANY(b+1, b+2, b+3, ....., 999);
Although in reality that kind of query might be rare, how would Calcite 
represent in Oracle dialect? And would it be good to support operations other 
than equal, like <, >=, or even customized operation? Especially when 
downstream projects may use calcite to build their own in-house system, some 
are not sql standard. 

BTW, what is your concern about the proposal of RexListCmp [1]?

Haisheng

[1] 
https://lists.apache.org/thread.html/ra36f256e3f86f78ee126455b3ba67dbcb7bd1862214886c65ffba432%40%3Cdev.calcite.apache.org%3E

On 2020/08/11 16:44:56, Julian Hyde <jhyde.apa...@gmail.com> wrote: 
> Let’s bring this discussion back to the original question: for these kinds of 
> predicates, is Sarg a better data structure than IN?
> 
> It absolutely is. There are algorithms in RexSimplify that, for each term in 
> an OR list, simplify using all available predicates, and when they have 
> simplified that term, add it to the list of predicates. It is therefore 
> cartesian in the size of the OR list.
> 
> The sarg data structure converts many kinds of large OR-lists and predicate 
> sets into a single term. Because that term is immutable and based on sorted 
> ranges (or points) it can be handled efficiently (e.g. two sargs can be 
> intersected using a merge, rather than nested loops). That will make our 
> simplification process more efficient.
> 
> Whether we then add new classes of optimization is a discussion for another 
> day.
> 
> Julian
> 
> 
> > On Aug 10, 2020, at 1:13 PM, Vladimir Sitnikov 
> > <sitnikov.vladi...@gmail.com> wrote:
> > 
> > Julian>I cannot see any cases that would become more expensive.
> > 
> > I mean the optimization passes might be complicated, not the storage of
> > sargs themselves.
> > For instance, CALCITE-4155 converts a in (1, 2, 3, 4, 5) to a >= 1 and a <=
> > 5
> > Is it a significant improvement?
> > Is between representation always better?
> > Does the case appear very often in practice?
> > 
> > However, the optimization does take time, so it looks like extra logic with
> > doubtful gains.
> > Of course, it is unlikely sargs would be a major time consumer, however, if
> > we keep adding tiny simplifications we might
> > end up with a condition where the planning is slow and we have no way to
> > fix it.
> > 
> > I'm ok with people updating RexSimplify (and 4155 looks like an innocent
> > feature), however, I think the current design is not really scalable
> > (e.g. it might process the same expression multiple times).
> > 
> > Vladimir
> 
>

Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

Reply via email to