[ 
https://issues.apache.org/jira/browse/KYLIN-5704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786217#comment-17786217
 ] 

Hongrong Cao commented on KYLIN-5704:
-------------------------------------

h2. 问题分析

`(cast col as string) in (x1,x2..)`,这种表达式是会被Spark优化器优化,会应用如下的规则进行优化:

{{{}cast(fromExp, toType) op value ==> fromExp op cast(value, 
fromType),{}}}生成最终的`IN`或`InSet`的Plan Node。

而`convertCastFilter`方法,中没有包含这两种情况的处理,最终导致了无法应用Filter推,过滤Segment。

分析整个查询逻辑,需要解决如下三个主要问题:
 # 
维度列裁剪时的空集异常:工单中的问题,是由于过滤字段中存在维度列,且抽取的维度列上的表达式集合为空,导致了Exception,因此需要在某些场景中能够返回可以估算的表达式,正如此场景下CASE;

 # 不能覆盖比较表达式或IN/INSET表达式中左表达式为非简单的Attribute表达式,或是右表达式为字面值或可估算的非字面值的情况;

 # 代码优化:解决In/InSet表达式,超大集的上的过滤性能;优化dim/shard列裁剪的过程;

h2. Design

方案一:在方法中添加`(cast expr as alias) in 
(…)`表达式的支持,将其转化为`等值表达式`,故可以在后续的过程中可以被优化器优化,因此可以参考Spark中的OptimizeIn&UnwrapCastInBinaryComparision优化逻辑,在`FilterPrunner`中添加对应的处理规则。

方案二:In表达式展开,展开成等值表达式,这样就不用考虑IN的场景了。

> For ‘in’ condition query of non-time partition columns, when the data type of 
> the value in 'in' condition is inconsistent with that of the non-time 
> partition column, the segment pruner fails, resulting in full Segment scanning
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KYLIN-5704
>                 URL: https://issues.apache.org/jira/browse/KYLIN-5704
>             Project: Kylin
>          Issue Type: Bug
>    Affects Versions: 5.0-alpha
>            Reporter: Hongrong Cao
>            Assignee: Guangyuan Feng
>            Priority: Major
>             Fix For: 5.0-beta
>
>
> The query column is a non-time partition column, a common dimension column, 
> and the filter condition of the common dimension column is col in (x1, x2...) 
> In this case (and because the col and x1 types do not match, it is 
> automatically converted to (cast col as string) in (x1,x2..), Fileprunner 
> will report an error because 
> org.apache.spark.sql.execution.datasource.FilePruner#convertCastFilter does 
> not handle in.
> Explain that the convertCastFilter method is to remove the cast condition, so 
> that the filter condition can be matched when calling 
> DataSourceStrategy.translateFilter, and then the Segment can be filtered. 
> However, currently convertCastFilter misses the processing of the in 
> condition, so translateFilter cannot match and becomes empty, so The query 
> was thrown incorrectly.
> In addition: if it is a time partition column, it does not matter if an error 
> is reported here, because in the previous steps, the calcite file prunner has 
> already completed the Segment Prune of the time partition column.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to