[
https://issues.apache.org/jira/browse/KYLIN-6047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17928361#comment-17928361
]
Guoliang Sun commented on KYLIN-6047:
-------------------------------------
h3. Root Cause
The issue arises from two aspects:
1. Default Configuration of `kylin.query.convert-in-to-or-threshold`:
- The parameter is set to a default value of 20, which is the default
behavior in Calcite. Its function is to convert `IN` filter conditions into
`OR`-connected conditions. The parameter value refers to the number of
condition values in the `IN` clause.
2. SQL-Specific Issue:
- For this SQL, the two-column `IN` filter condition is transformed into a
`Row` operator format in Calcite. However, Kylin does not implement this
functionality when converting the Calcite Logical Plan to the Spark Logical
Plan, resulting in an error.
----
h4. 2.1 Key Questions and Answers
* Q1: Why does the older version support such queries (older versions refer to
Kylin versions before upgrading Calcite from 1.16 to 1.30)?
** A1:
In the older version of Calcite, this parameter was set to `Integer.MAX_VALUE`.
In theory, users' filter condition values would never reach such a scenario.
* Q2: Why was the value set to `Integer.MAX_VALUE` in the older version of
Calcite, and why wasn't this behavior retained after upgrading Calcite?
** A2:
*** Reason for Setting to `Integer.MAX_VALUE` in Older Calcite
Versions:Calcite had a functional limitation at the time: it could not support
such SQL queries and would throw an error during the parsing phase. Until
version 1.17, Calcite added support for the `Row` operator
([CALCITE-2276](https://issues.apache.org/jira/browse/CALCITE-2276): Allow
explicit ROW value constructor in SELECT clause and elsewhere). To bypass this
issue, Kylin simply set the parameter to `Integer.MAX_VALUE`, forcing all `IN`
expressions to be converted into `OR` expressions.
*** Reasons for Not Continuing to Use `Integer.MAX_VALUE`:
**** Performance Overhead: Converting `IN` to `OR` introduces performance
overhead. Calcite's default value is 20, and to align with the community, we
adopted this value. While the community can only modify this configuration at
the code level, KE supports system-level parameter configuration. Customers
encountering this issue can temporarily increase the parameter value as a
workaround.
**** Spark-Specific Limitations:There are differences between how Calcite and
Spark handle filter conditions. When the number of condition values becomes
excessive, it may cause a stack overflow in Spark. To illustrate, consider the
following filter expression: *(a, b) IN ((1, 2), (2, 3), (3, 4), ..., (n,
n+1).* During the computation of the Spark DataFrame in KE, the conversion of
the filter expression involves the main class
`org.apache.kylin.query.runtime.plan.FilterPlan`. The code that converts
Calcite `RexNode` to Spark `Expr` can be found in
`org.apache.kylin.query.runtime.SparderRexVisitor` and
`org.apache.kylin.query.runtime.ExpressionConverter`. These classes handle the
extraction and transformation of most logical plans and function operators. A
deep call stack during the traversal of `Expr` operations lays the groundwork
for potential issues.
***** !image-2025-02-19-17-21-17-187.png|width=470,height=251!
> Error Occurs When the Number of Values in an IN Clause Reaches 20
> -----------------------------------------------------------------
>
> Key: KYLIN-6047
> URL: https://issues.apache.org/jira/browse/KYLIN-6047
> Project: Kylin
> Issue Type: Bug
> Affects Versions: 5.0.0
> Reporter: Guoliang Sun
> Priority: Major
> Attachments: image-2025-02-19-17-21-17-187.png
>
>
> h3. Temporary Solution
> Increase the value of `kylin.query.convert-in-to-or-threshold`. However,
> setting this parameter too high may lead to performance issues, as there
> could be cases where the number of values in the `IN` clause exceeds 100. A
> fix is required to address this issue properly.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)