Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

Julian Hyde Fri, 28 Aug 2020 14:20:13 -0700

I now have a PR for this change. Can some people review?

https://github.com/apache/calcite/pull/2124 
<https://github.com/apache/calcite/pull/2124>


There are legitimate concerns that we are adding a new RexCall operator 
(SEARCH) and there is a considerable effort to support it everywhere. But we 
are removing two RexCall operators (IN-list and BETWEEN), because SEARCH 
subsumes them.

The benefit is that we can do ‘range algebra’ simplifications without resorting 
to error-prone Boolean logic simplifications as often as we do today.

I have added utilities to introduce SEARCH into SEARCH-less RexNode 
expressions, and to remove SEARCH from RexNode expressions, and to convert 
RexNode expressions that contain SEARCH directly into SqlNode expressions.

In several places, I just expanded SEARCH, but it would be better to handle 
SEARCH explicitly, as if it were a comparison operator (along with =, <, <= 
etc.) It might be worth revisiting the Mongo, Geode and Druid adapters. Geode 
has an ‘IS SET’ construct, very similar to SQL ‘IN-list’, that would easier to 
recognize on SEARCH than on OR.

Julian




> On Aug 13, 2020, at 12:48 PM, Julian Hyde <jhyde.apa...@gmail.com> wrote:
> 
> I have logged https://issues.apache.org/jira/browse/CALCITE-4173 
> <https://issues.apache.org/jira/browse/CALCITE-4173> and am close to a 
> prototype.
> 
> See comments inline, below.
> 
>> On Aug 11, 2020, at 3:39 PM, Haisheng Yuan <hy...@apache.org 
>> <mailto:hy...@apache.org>> wrote:
>> 
>>> I am proposing to use sargs only where we would today use RexCall(IN). The 
>>> data structures have about the same size. Sargs are sorted. I cannot see 
>>> any cases that would become more expensive.
>> 
>> IIRC, RexCall(IN) is not supported yet.
> 
> It’s not officially supported. But it is there - added in 
> https://issues.apache.org/jira/browse/CALCITE-2444 
> <https://issues.apache.org/jira/browse/CALCITE-2444> for just Rel-to-Sql part 
> of the process, and people seem to be using it elsewhere, e.g. 
> https://issues.apache.org/jira/browse/CALCITE-1413 
> <https://issues.apache.org/jira/browse/CALCITE-1413> handles IN in CASE.
> 
> I have not used RexCall.IN personally. I think there’s too much cost (in 
> making all places aware of the IN operator) and not enough benefit (because 
> it cannot handle ranges or <>).
> 
> One part of the cost is null-semantics. If the values are not literals, one 
> of them might evaluate to NULL. So we have to be very conservative, 
> especially when optimizing “NOT IN (list)”. There are no tests for that 
> simplification, so we’re probably doing it wrong.
> 
> So, I propose to remove IN.
> 
>> With sargs, you have to sort it, no matter explicitly or implicitly, in case 
>> 20k values, it will take some time.
> 
> I am not too concerned about the cost of sorting. If you have parsed and 
> validated an IN list with 20k items, the query has already burned a few CPU 
> cycles.
> 
> The Sarg data structure is immutable and one instance should suffice for the 
> whole planning process, so the cost of sorting on creation is more than 
> repaid by the faster access later on.
> 
>> I am not saying the data structure itself is expensive, but how it is used, 
>> it all depends on how downstream projects derive stats from it. I have seen 
>> Greenplum optimizer experienced performance issue for large IN list when 
>> deriving all the stats info from the thousands of disjoint range intervals. 
>> Some downstream project may not even use histogram. Some may just use the 
>> number of IN constants to derive the number of distinct values, I would 
>> imagine people have to construct the Sarg back to constant list, this is an 
>> extra burden. 
> 
> If there are performance issues, we can add extra operations to the Sarg data 
> structure without changing its size/cost. For example, record whether all the 
> ranges are points. And if so, efficiently iterate over all the values.
> 
>> Suppose we have col in (1,2,4,6,8,10, 12,13, 15,17,19, 21,22, 24, 24+2.....) 
>> 10k values, now the range sets will be {[1,2], [4,4], ....[10,10], [12,13], 
>> [15,15]....[21,22], [24,24]....}, in the execution engine, what if I want to 
>> do the look up for this expression? Find out the points, extract ranges and 
>> disjoint them, or infer points from integer ranges?
> 
> You should transcribe the Sarg into a data structure that makes sense for 
> your execution engine. Say a bloom filter, a compressed bitmap, or a perfect 
> hash.
> 
>> IMHO, the interval constraint is more like a kind of logical property and 
>> can be strategy for simplification, people can derive the interval 
>> constraints from the IN list, do simplification on it (but I doubt the value 
>> in practice, which usually brings more trouble than benefits), but we don't 
>> have to represent it with sarg in RexNode world.
> 
> I agree that it should be a logical property. It should be part of 
> RelOptPredicateList. But to apply those predicates efficiently, we need one 
> predicate that searches in a range rather than 20k predicates.
> 
>> I am not asking sargs to support geospatial operations. And geospatial 
>> intersect is just an example of potential customized operation, which are 
>> not limited to geospatials. It can be some other binary operations. We can't 
>> require all the customized operation to have something like ST_Collect. 
> 
> Sure, there are other techniques. One of which is generating rasters (e.g. if 
> you are doing a spatial join of states to national parks, pre-generate for 
> each national park a bitmap of the all 100x100km squares, and put a 1 in the 
> square if it overlaps the park). Rasters are kind of similar to sargs.
> 
> I have been working on space-filling curves as indexes for point-to-shape 
> joins, as part of https://issues.apache.org/jira/browse/CALCITE-1861 
> <https://issues.apache.org/jira/browse/CALCITE-1861>. In 
> https://github.com/julianhyde/calcite/blob/829c33a524adb727482df27d605d193a0c7a280c/core/src/test/resources/org/apache/calcite/test/RelOptRulesTest.xml#L7966
>  
> <https://github.com/julianhyde/calcite/blob/829c33a524adb727482df27d605d193a0c7a280c/core/src/test/resources/org/apache/calcite/test/RelOptRulesTest.xml#L7966>
>  see how we transform a call to ST_DWithin to the predicate 
> 
> LogicalFilter(condition=[AND(OR(AND(>=($4, 36496), <=($4, 36520)), AND(>=($4, 
> 36456), <=($4, 36464)), AND(>=($4, 33252), <=($4, 33254)), AND(>=($4, 33236), 
> <=($4, 33244)), AND(>=($4, 33164), <=($4, 33176)), AND(>=($4, 33112), <=($4, 
> 33156)), AND(>=($4, 33092), <=($4, 33100)), AND(>=($4, 33055), <=($4, 
> 33080)), AND(>=($4, 33050), <=($4, 33053)), AND(>=($4, 33033), <=($4, 
> 33035))), ST_DWITHIN(POINT (10 20):GEOMETRY, ST_POINT($1, $2), 6))])
> 
> The first part of the condition is precisely a Sarg - a set of integer 
> ranges. If represented an a set of points, it would be much larger.
> 
>> Oracle support this:
>> select * from foo where a > ANY(1, 2, 3, ....., 999);
>> select * from foo where a <= ANY(b+1, b+2, b+3, ....., 999);
>> Although in reality that kind of query might be rare, how would Calcite 
>> represent in Oracle dialect? And would it be good to support operations 
>> other than equal, like <, >=, or even customized operation? Especially when 
>> downstream projects may use calcite to build their own in-house system, some 
>> are not sql standard.
> 
> I think of ANY and ALL - especially on constant lists, as opposed to queries 
> - as syntactic sugar.
> 
> The first one can be represented as a Sarg:
> 
>   SEARCH(a, SARG((1, inf), (2, inf), (3, inf), … (999, inf))
> 
> And the Sarg would optimize itself, on construction, to a single range, 
> SARG((1, inf)). This illustrates the power of Sargs - they are simple 
> mathematical objects, they are in a canonical form, and there are only a few 
> operations. It would take thought to optimize
> 
>   a > ANY (1, 2, 3, …, 9)
> 
> to
> 
>   a > 1
> 
> but Sarg does it automatically.
> 
> The second query cannot be represented as a Sarg, because it is not a 
> constant.
> 
>> BTW, what is your concern about the proposal of RexListCmp [1]?
> 
> RexListCmp is certainly moving in the same direction as Sarg - a generalized 
> IN-list. But it is less elegant - it is not clear to me what are the 
> fundamental operations on a RexListCmp.
> 
> The ability to choose your operator adds complexity but little power. The 
> “generalized IN lists” that I see in plans are better modeled by a mixture of 
> operators - "x > 3 AND x < 10 OR x = 100 OR x > 1000” than having one 
> operator throughout the list.
> 
> I was not aware of Oracle's “expression relop (ANY | ALL) (list)” syntax and 
> I don’t think it’s very common. It can be expanded at compile time to an OR, 
> and that OR can be converted to a Sarg if list consists only of constant 
> expressions. I think that is a good solution to the common case.
> 
> Julian
>

Re: [DISCUSS] Sarg (search argument) to generalize and replace IN in RexCall

Reply via email to