I now have a PR for this change. Can some people review? https://github.com/apache/calcite/pull/2124 <https://github.com/apache/calcite/pull/2124>
There are legitimate concerns that we are adding a new RexCall operator (SEARCH) and there is a considerable effort to support it everywhere. But we are removing two RexCall operators (IN-list and BETWEEN), because SEARCH subsumes them. The benefit is that we can do ‘range algebra’ simplifications without resorting to error-prone Boolean logic simplifications as often as we do today. I have added utilities to introduce SEARCH into SEARCH-less RexNode expressions, and to remove SEARCH from RexNode expressions, and to convert RexNode expressions that contain SEARCH directly into SqlNode expressions. In several places, I just expanded SEARCH, but it would be better to handle SEARCH explicitly, as if it were a comparison operator (along with =, <, <= etc.) It might be worth revisiting the Mongo, Geode and Druid adapters. Geode has an ‘IS SET’ construct, very similar to SQL ‘IN-list’, that would easier to recognize on SEARCH than on OR. Julian > On Aug 13, 2020, at 12:48 PM, Julian Hyde <jhyde.apa...@gmail.com> wrote: > > I have logged https://issues.apache.org/jira/browse/CALCITE-4173 > <https://issues.apache.org/jira/browse/CALCITE-4173> and am close to a > prototype. > > See comments inline, below. > >> On Aug 11, 2020, at 3:39 PM, Haisheng Yuan <hy...@apache.org >> <mailto:hy...@apache.org>> wrote: >> >>> I am proposing to use sargs only where we would today use RexCall(IN). The >>> data structures have about the same size. Sargs are sorted. I cannot see >>> any cases that would become more expensive. >> >> IIRC, RexCall(IN) is not supported yet. > > It’s not officially supported. But it is there - added in > https://issues.apache.org/jira/browse/CALCITE-2444 > <https://issues.apache.org/jira/browse/CALCITE-2444> for just Rel-to-Sql part > of the process, and people seem to be using it elsewhere, e.g. > https://issues.apache.org/jira/browse/CALCITE-1413 > <https://issues.apache.org/jira/browse/CALCITE-1413> handles IN in CASE. > > I have not used RexCall.IN personally. I think there’s too much cost (in > making all places aware of the IN operator) and not enough benefit (because > it cannot handle ranges or <>). > > One part of the cost is null-semantics. If the values are not literals, one > of them might evaluate to NULL. So we have to be very conservative, > especially when optimizing “NOT IN (list)”. There are no tests for that > simplification, so we’re probably doing it wrong. > > So, I propose to remove IN. > >> With sargs, you have to sort it, no matter explicitly or implicitly, in case >> 20k values, it will take some time. > > I am not too concerned about the cost of sorting. If you have parsed and > validated an IN list with 20k items, the query has already burned a few CPU > cycles. > > The Sarg data structure is immutable and one instance should suffice for the > whole planning process, so the cost of sorting on creation is more than > repaid by the faster access later on. > >> I am not saying the data structure itself is expensive, but how it is used, >> it all depends on how downstream projects derive stats from it. I have seen >> Greenplum optimizer experienced performance issue for large IN list when >> deriving all the stats info from the thousands of disjoint range intervals. >> Some downstream project may not even use histogram. Some may just use the >> number of IN constants to derive the number of distinct values, I would >> imagine people have to construct the Sarg back to constant list, this is an >> extra burden. > > If there are performance issues, we can add extra operations to the Sarg data > structure without changing its size/cost. For example, record whether all the > ranges are points. And if so, efficiently iterate over all the values. > >> Suppose we have col in (1,2,4,6,8,10, 12,13, 15,17,19, 21,22, 24, 24+2.....) >> 10k values, now the range sets will be {[1,2], [4,4], ....[10,10], [12,13], >> [15,15]....[21,22], [24,24]....}, in the execution engine, what if I want to >> do the look up for this expression? Find out the points, extract ranges and >> disjoint them, or infer points from integer ranges? > > You should transcribe the Sarg into a data structure that makes sense for > your execution engine. Say a bloom filter, a compressed bitmap, or a perfect > hash. > >> IMHO, the interval constraint is more like a kind of logical property and >> can be strategy for simplification, people can derive the interval >> constraints from the IN list, do simplification on it (but I doubt the value >> in practice, which usually brings more trouble than benefits), but we don't >> have to represent it with sarg in RexNode world. > > I agree that it should be a logical property. It should be part of > RelOptPredicateList. But to apply those predicates efficiently, we need one > predicate that searches in a range rather than 20k predicates. > >> I am not asking sargs to support geospatial operations. And geospatial >> intersect is just an example of potential customized operation, which are >> not limited to geospatials. It can be some other binary operations. We can't >> require all the customized operation to have something like ST_Collect. > > Sure, there are other techniques. One of which is generating rasters (e.g. if > you are doing a spatial join of states to national parks, pre-generate for > each national park a bitmap of the all 100x100km squares, and put a 1 in the > square if it overlaps the park). Rasters are kind of similar to sargs. > > I have been working on space-filling curves as indexes for point-to-shape > joins, as part of https://issues.apache.org/jira/browse/CALCITE-1861 > <https://issues.apache.org/jira/browse/CALCITE-1861>. In > https://github.com/julianhyde/calcite/blob/829c33a524adb727482df27d605d193a0c7a280c/core/src/test/resources/org/apache/calcite/test/RelOptRulesTest.xml#L7966 > > <https://github.com/julianhyde/calcite/blob/829c33a524adb727482df27d605d193a0c7a280c/core/src/test/resources/org/apache/calcite/test/RelOptRulesTest.xml#L7966> > see how we transform a call to ST_DWithin to the predicate > > LogicalFilter(condition=[AND(OR(AND(>=($4, 36496), <=($4, 36520)), AND(>=($4, > 36456), <=($4, 36464)), AND(>=($4, 33252), <=($4, 33254)), AND(>=($4, 33236), > <=($4, 33244)), AND(>=($4, 33164), <=($4, 33176)), AND(>=($4, 33112), <=($4, > 33156)), AND(>=($4, 33092), <=($4, 33100)), AND(>=($4, 33055), <=($4, > 33080)), AND(>=($4, 33050), <=($4, 33053)), AND(>=($4, 33033), <=($4, > 33035))), ST_DWITHIN(POINT (10 20):GEOMETRY, ST_POINT($1, $2), 6))]) > > The first part of the condition is precisely a Sarg - a set of integer > ranges. If represented an a set of points, it would be much larger. > >> Oracle support this: >> select * from foo where a > ANY(1, 2, 3, ....., 999); >> select * from foo where a <= ANY(b+1, b+2, b+3, ....., 999); >> Although in reality that kind of query might be rare, how would Calcite >> represent in Oracle dialect? And would it be good to support operations >> other than equal, like <, >=, or even customized operation? Especially when >> downstream projects may use calcite to build their own in-house system, some >> are not sql standard. > > I think of ANY and ALL - especially on constant lists, as opposed to queries > - as syntactic sugar. > > The first one can be represented as a Sarg: > > SEARCH(a, SARG((1, inf), (2, inf), (3, inf), … (999, inf)) > > And the Sarg would optimize itself, on construction, to a single range, > SARG((1, inf)). This illustrates the power of Sargs - they are simple > mathematical objects, they are in a canonical form, and there are only a few > operations. It would take thought to optimize > > a > ANY (1, 2, 3, …, 9) > > to > > a > 1 > > but Sarg does it automatically. > > The second query cannot be represented as a Sarg, because it is not a > constant. > >> BTW, what is your concern about the proposal of RexListCmp [1]? > > RexListCmp is certainly moving in the same direction as Sarg - a generalized > IN-list. But it is less elegant - it is not clear to me what are the > fundamental operations on a RexListCmp. > > The ability to choose your operator adds complexity but little power. The > “generalized IN lists” that I see in plans are better modeled by a mixture of > operators - "x > 3 AND x < 10 OR x = 100 OR x > 1000” than having one > operator throughout the list. > > I was not aware of Oracle's “expression relop (ANY | ALL) (list)” syntax and > I don’t think it’s very common. It can be expanded at compile time to an OR, > and that OR can be converted to a Sarg if list consists only of constant > expressions. I think that is a good solution to the common case. > > Julian >