[
https://issues.apache.org/jira/browse/KYLIN-5704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786217#comment-17786217
]
Hongrong Cao commented on KYLIN-5704:
-------------------------------------
h2. 问题分析
`(cast col as string) in (x1,x2..)`,这种表达式是会被Spark优化器优化,会应用如下的规则进行优化:
{{{}cast(fromExp, toType) op value ==> fromExp op cast(value,
fromType),{}}}生成最终的`IN`或`InSet`的Plan Node。
而`convertCastFilter`方法,中没有包含这两种情况的处理,最终导致了无法应用Filter推,过滤Segment。
分析整个查询逻辑,需要解决如下三个主要问题:
#
维度列裁剪时的空集异常:工单中的问题,是由于过滤字段中存在维度列,且抽取的维度列上的表达式集合为空,导致了Exception,因此需要在某些场景中能够返回可以估算的表达式,正如此场景下CASE;
# 不能覆盖比较表达式或IN/INSET表达式中左表达式为非简单的Attribute表达式,或是右表达式为字面值或可估算的非字面值的情况;
# 代码优化:解决In/InSet表达式,超大集的上的过滤性能;优化dim/shard列裁剪的过程;
h2. Design
方案一:在方法中添加`(cast expr as alias) in
(…)`表达式的支持,将其转化为`等值表达式`,故可以在后续的过程中可以被优化器优化,因此可以参考Spark中的OptimizeIn&UnwrapCastInBinaryComparision优化逻辑,在`FilterPrunner`中添加对应的处理规则。
方案二:In表达式展开,展开成等值表达式,这样就不用考虑IN的场景了。
> For ‘in’ condition query of non-time partition columns, when the data type of
> the value in 'in' condition is inconsistent with that of the non-time
> partition column, the segment pruner fails, resulting in full Segment scanning
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: KYLIN-5704
> URL: https://issues.apache.org/jira/browse/KYLIN-5704
> Project: Kylin
> Issue Type: Bug
> Affects Versions: 5.0-alpha
> Reporter: Hongrong Cao
> Assignee: Guangyuan Feng
> Priority: Major
> Fix For: 5.0-beta
>
>
> The query column is a non-time partition column, a common dimension column,
> and the filter condition of the common dimension column is col in (x1, x2...)
> In this case (and because the col and x1 types do not match, it is
> automatically converted to (cast col as string) in (x1,x2..), Fileprunner
> will report an error because
> org.apache.spark.sql.execution.datasource.FilePruner#convertCastFilter does
> not handle in.
> Explain that the convertCastFilter method is to remove the cast condition, so
> that the filter condition can be matched when calling
> DataSourceStrategy.translateFilter, and then the Segment can be filtered.
> However, currently convertCastFilter misses the processing of the in
> condition, so translateFilter cannot match and becomes empty, so The query
> was thrown incorrectly.
> In addition: if it is a time partition column, it does not matter if an error
> is reported here, because in the previous steps, the calcite file prunner has
> already completed the Segment Prune of the time partition column.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)