[jira] [Commented] (KYLIN-6047) Error Occurs When the Number of Values in an IN Clause Reaches 20

Guoliang Sun (Jira) Wed, 19 Feb 2025 01:22:17 -0800


    [ 
https://issues.apache.org/jira/browse/KYLIN-6047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17928361#comment-17928361
 ]


Guoliang Sun commented on KYLIN-6047:
-------------------------------------

h3. Root Cause

The issue arises from two aspects:  

1. Default Configuration of `kylin.query.convert-in-to-or-threshold`:
   - The parameter is set to a default value of 20, which is the default 
behavior in Calcite. Its function is to convert `IN` filter conditions into 
`OR`-connected conditions. The parameter value refers to the number of 
condition values in the `IN` clause.  

2. SQL-Specific Issue:
   - For this SQL, the two-column `IN` filter condition is transformed into a 
`Row` operator format in Calcite. However, Kylin does not implement this 
functionality when converting the Calcite Logical Plan to the Spark Logical 
Plan, resulting in an error.  
----
h4. 2.1 Key Questions and Answers
 * Q1: Why does the older version support such queries (older versions refer to 
Kylin versions before upgrading Calcite from 1.16 to 1.30)?
 ** A1:
In the older version of Calcite, this parameter was set to `Integer.MAX_VALUE`. 
In theory, users' filter condition values would never reach such a scenario. 
 * Q2: Why was the value set to `Integer.MAX_VALUE` in the older version of 
Calcite, and why wasn't this behavior retained after upgrading Calcite?
 ** A2: 
 *** Reason for Setting to `Integer.MAX_VALUE` in Older Calcite 
Versions:Calcite had a functional limitation at the time: it could not support 
such SQL queries and would throw an error during the parsing phase. Until 
version 1.17, Calcite added support for the `Row` operator 
([CALCITE-2276](https://issues.apache.org/jira/browse/CALCITE-2276): Allow 
explicit ROW value constructor in SELECT clause and elsewhere).  To bypass this 
issue, Kylin simply set the parameter to `Integer.MAX_VALUE`, forcing all `IN` 
expressions to be converted into `OR` expressions. 
 *** Reasons for Not Continuing to Use `Integer.MAX_VALUE`:
 **** Performance Overhead: Converting `IN` to `OR` introduces performance 
overhead. Calcite's default value is 20, and to align with the community, we 
adopted this value. While the community can only modify this configuration at 
the code level, KE supports system-level parameter configuration. Customers 
encountering this issue can temporarily increase the parameter value as a 
workaround. 
 **** Spark-Specific Limitations:There are differences between how Calcite and 
Spark handle filter conditions. When the number of condition values becomes 
excessive, it may cause a stack overflow in Spark. To illustrate, consider the 
following filter expression:  *(a, b) IN ((1, 2), (2, 3), (3, 4), ..., (n, 
n+1).* During the computation of the Spark DataFrame in KE, the conversion of 
the filter expression involves the main class 
`org.apache.kylin.query.runtime.plan.FilterPlan`. The code that converts 
Calcite `RexNode` to Spark `Expr` can be found in 
`org.apache.kylin.query.runtime.SparderRexVisitor` and 
`org.apache.kylin.query.runtime.ExpressionConverter`. These classes handle the 
extraction and transformation of most logical plans and function operators. A 
deep call stack during the traversal of `Expr` operations lays the groundwork 
for potential issues.  
 ***** !image-2025-02-19-17-21-17-187.png|width=470,height=251!

> Error Occurs When the Number of Values in an IN Clause Reaches 20
> -----------------------------------------------------------------
>
>                 Key: KYLIN-6047
>                 URL: https://issues.apache.org/jira/browse/KYLIN-6047
>             Project: Kylin
>          Issue Type: Bug
>    Affects Versions: 5.0.0
>            Reporter: Guoliang Sun
>            Priority: Major
>         Attachments: image-2025-02-19-17-21-17-187.png
>
>
> h3. Temporary Solution
> Increase the value of `kylin.query.convert-in-to-or-threshold`. However, 
> setting this parameter too high may lead to performance issues, as there 
> could be cases where the number of values in the `IN` clause exceeds 100. A 
> fix is required to address this issue properly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KYLIN-6047) Error Occurs When the Number of Values in an IN Clause Reaches 20

Reply via email to