Hello Andrei!

I've also bumped into this earlier - from a little bit different angle :)

I've found some "back story about it": It works currently like this because a 
long time ago there was no other way to do the subquery conversion of IN;
I think now the preferred way would be to have a rule to do that conversion...

About (3) leaving INs as is: I've experimented with that earlier and it seems 
like there are some places where IN is handled only as a subquery.
If we don't open INs; then there is a chance that some expression 
simplification will not happen (because it's not handled there either).

For Hive we've decided to follow the Calcite way for now; so Hive also opens 
ORs before it calls Calcite - but there's also a rule which tries to close ORs 
into an IN.
https://github.com/apache/hive/blob/236a32c645a21b04ccaf7f18db5c6a5aa53586e8/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HivePointLookupOptimizerRule.java#L217

I plan to get back to this; or at least get CALCITE-2444.

cheers,
Zoltan


On 08/14/2018 07:38 AM, Andrei Sereda wrote:
Hello,

I have noticed that our queries get converted into full table scans when
they contain IN predicates with long list of elements (eg. id IN ($manyIds)
).

Upon some debuging the reason seems to be
SqlToRelConverter.substituteSubQuery()
<https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/sql2rel/SqlToRelConverter.java#L1001>.
Calcite will automatically translate subqueries into joins when
DEFAULT_IN_SUB_QUERY_THRESHOLD
<https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/sql2rel/SqlToRelConverter.java#L221>
is
reached (currently 20 elements). We have instances when $manyIds can have
10K of elements (out of 50M rows table).

Some questions:

    1.

    In order to skip this query translation seems like I have to change
    SqlToRelConverter.Config.InSubQueryThreshold to MAX_VALUE. How can one
    do it in CalciteConnection or PreparedStatement ? Is is the only way ?
    2.

    Is SubQueryConverter
    
<https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/sql2rel/SubQueryConverter.java>
    of any use in this particular case ?
    3.

    Can one keep IN predicate "as is" without converting to disjunctions
    (see convertInToOr
    
<https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/sql2rel/SqlToRelConverter.java#L1396>)
    ?
    The reason I’m asking is because our elastic queries get pretty big if
    constructucted using ORs (multiple nested JSONs for each element vs
    simple JSON array).

Example to reproduce (for Mongo Adapter)

@Testpublic void subQueryTheshold() {
     // SqlToRelConverter.DEFAULT_IN_SUB_QUERY_THRESHOLD = 20
     // just a long list of ids (>20 to trigger query rewrite)
     final String ids = IntStream.range(0,
SqlToRelConverter.DEFAULT_IN_SUB_QUERY_THRESHOLD)
         .mapToObj(i -> String.format("'%d'", i))
         .collect(Collectors.joining(","));

     assertModel(MODEL)
         .query(
             "select * from zips where state in (" + ids + ")")
         .explainContains("MongoFilter")
         .returns("");
}

As usual, thanks a lot for your help.

Andrei.

Reply via email to