[jira] [Commented] (KYLIN-5704) For ‘in’ condition query of non-time partition columns, when the data type of the value in 'in' condition is inconsistent with that of the non-time partition column, the segment pruner fails, resulting in full Segment scanning

Hongrong Cao (Jira) Wed, 15 Nov 2023 00:11:34 -0800


    [ 
https://issues.apache.org/jira/browse/KYLIN-5704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786217#comment-17786217
 ]


Hongrong Cao commented on KYLIN-5704:
-------------------------------------

h2. 问题分析

`(cast col as string) in (x1,x2..)`，这种表达式是会被Spark优化器优化，会应用如下的规则进行优化：

{{{}cast(fromExp, toType) op value ==> fromExp op cast(value, 
fromType)，{}}}生成最终的`IN`或`InSet`的Plan Node。

而`convertCastFilter`方法，中没有包含这两种情况的处理，最终导致了无法应用Filter推，过滤Segment。

分析整个查询逻辑，需要解决如下三个主要问题：
 # 
维度列裁剪时的空集异常：工单中的问题，是由于过滤字段中存在维度列，且抽取的维度列上的表达式集合为空，导致了Exception，因此需要在某些场景中能够返回可以估算的表达式，正如此场景下CASE；

 # 不能覆盖比较表达式或IN/INSET表达式中左表达式为非简单的Attribute表达式，或是右表达式为字面值或可估算的非字面值的情况；

 # 代码优化：解决In/InSet表达式，超大集的上的过滤性能；优化dim/shard列裁剪的过程；

h2. Design

方案一：在方法中添加`(cast expr as alias) in 
(…)`表达式的支持，将其转化为`等值表达式`，故可以在后续的过程中可以被优化器优化，因此可以参考Spark中的OptimizeIn&UnwrapCastInBinaryComparision优化逻辑，在`FilterPrunner`中添加对应的处理规则。

方案二：In表达式展开，展开成等值表达式，这样就不用考虑IN的场景了。

> For ‘in’ condition query of non-time partition columns, when the data type of 
> the value in 'in' condition is inconsistent with that of the non-time 
> partition column, the segment pruner fails, resulting in full Segment scanning
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KYLIN-5704
>                 URL: https://issues.apache.org/jira/browse/KYLIN-5704
>             Project: Kylin
>          Issue Type: Bug
>    Affects Versions: 5.0-alpha
>            Reporter: Hongrong Cao
>            Assignee: Guangyuan Feng
>            Priority: Major
>             Fix For: 5.0-beta
>
>
> The query column is a non-time partition column, a common dimension column, 
> and the filter condition of the common dimension column is col in (x1, x2...) 
> In this case (and because the col and x1 types do not match, it is 
> automatically converted to (cast col as string) in (x1,x2..), Fileprunner 
> will report an error because 
> org.apache.spark.sql.execution.datasource.FilePruner#convertCastFilter does 
> not handle in.
> Explain that the convertCastFilter method is to remove the cast condition, so 
> that the filter condition can be matched when calling 
> DataSourceStrategy.translateFilter, and then the Segment can be filtered. 
> However, currently convertCastFilter misses the processing of the in 
> condition, so translateFilter cannot match and becomes empty, so The query 
> was thrown incorrectly.
> In addition: if it is a time partition column, it does not matter if an error 
> is reported here, because in the previous steps, the calcite file prunner has 
> already completed the Segment Prune of the time partition column.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KYLIN-5704) For ‘in’ condition query of non-time partition columns, when the data type of the value in 'in' condition is inconsistent with that of the non-time partition column, the segment pruner fails, resulting in full Segment scanning

Reply via email to