[ 
https://issues.apache.org/jira/browse/CALCITE-6467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866019#comment-17866019
 ] 

Gonzalo Ortiz commented on CALCITE-6467:
----------------------------------------

It seems we are transforming the IN clause into an OR in 
[PinotFilterExpandSearchRule|https://github.com/apache/pinot/blob/master/pinot-query-planner/src/main/java/org/apache/pinot/calcite/rel/rules/PinotFilterExpandSearchRule.java].
 We do that because we don't have a rule that transforms SEARCH into a valid 
pinot operator. As said before, we are not doing things 100% as Calcite 
expects. Instead of using rules to transform Logical nodes into our own 
RelNodes that include implementation, we transform them into other classes 
using a visitor.

I'm going to try to modify Pinot to be able to use SEARCH RelNodes, which 
should mitigate this problem. Anyway, what is reported here is an important 
regression in Calcite performance when large ORs are optimized. Feel free to 
close this issue if you think it is an acceptable perf regression.

> Performance of RelMdUtil.checkInputForCollationAndLimit when using `where col 
> in (large literal set)`
> -----------------------------------------------------------------------------------------------------
>
>                 Key: CALCITE-6467
>                 URL: https://issues.apache.org/jira/browse/CALCITE-6467
>             Project: Calcite
>          Issue Type: Improvement
>    Affects Versions: 1.32.0, 1.33.0, 1.34.0, 1.35.0, 1.36.0, 1.37.0, 1.38.0
>            Reporter: Gonzalo Ortiz
>            Priority: Major
>         Attachments: MultistageEngineQuickStart_2024_07_12_111558.jfr, 
> image-2024-07-12-15-58-33-504.png
>
>
> Recently we have updated Pinot to use Calcite 1.37. Previously we were using 
> 1.31.
> After the upgrade, we have found issues when executing some queries that 
> include large IN clauses. Queries like:
> {code:java}
> explain plan for
> SELECT DestCityName
> FROM (
>          SELECT DestCityName
>          FROM airlineStats
>          WHERE DestCityName IN (
>                         'a1', 'a2', 'a3', ... 'a300'
>              )
>          GROUP BY DestCityName
>      ) as a
> {code}
> After some debug, we have found that the issue is in one of our custom rules (
> PinotSortExchangeCopyRule) when we call 
> `RelMdUtil.checkInputForCollationAndLimit` 
> ([link|https://github.com/apache/pinot/blob/dacc6d06907c44e83721454f1090e5f00c824f15/pinot-query-planner/src/main/java/org/apache/pinot/calcite/rel/rules/PinotSortExchangeCopyRule.java#L64]).
>  
> Comparing different Pinot versions, I've found out that the change that is 
> causing problems in our scenario is #CALCITE-5036 (see
>  
> [https://github.com/apache/calcite/pull/2743)|https://github.com/apache/calcite/pull/2743/files).].
>  Specifically, it looks like the problem appears when `RelMdPredicates` tries 
> to simplify the expression, which at the time is basically an 
> `OR(DestCityName = a1, DestCityName = a2, ...)`. 
> `RexSimplify.simplifyOrTerms` negates previous terms in order to apply 
> possible optimizations, but in cases like this where we have hundreds of 
> literals, that is very expensive. Going more in detail, it looks like most of 
> the time is invested in creating new ranges:
> !image-2024-07-12-15-58-33-504.png!
>  
> You may find more insights in the attached JFR file
> [^MultistageEngineQuickStart_2024_07_12_111558.jfr][^MultistageEngineQuickStart_2024_07_12_111558.jfr]
>  
> I've tried to reproduce the problem with `sqline` but I wasn't able to do so. 
> As far as I can see in the code, RelMdUtil.checkInputForCollationAndLimit is 
> only called in SortJoinCopyRule, SortJoinTransposeRule and 
> SortUnionTransposeRule. I've tried to create a test or JMH benchmark in 
> Calcite to try to reproduce the issue, but I don't know codebase well enough.
>  
> I don't consider myself an expert on Apache Calcite and I know we are not 
> using Calcite in the most standard way (we are slowly migrating from our own 
> engine to Calcite), but I'm pretty confident this issue may also affect other 
> Calcite usages. At least in the trace I cannot see anything Pinot specific.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to