[
https://issues.apache.org/jira/browse/KYLIN-5704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786231#comment-17786231
]
Hongrong Cao commented on KYLIN-5704:
-------------------------------------
Problem Analysis
`(cast col as string) in (x1,x2...) `, such expressions are optimized by the
Spark optimizer, which applies the following rules for optimization:
cast(fromExp, toType) op value ==> fromExp op cast(value, fromType) to generate
the final `IN` or `InSet` Plan Node.
The `convertCastFilter` method, however, does not include the handling of these
two cases, which ultimately results in the inability to apply the Filter push
and filter the Segment.
Analyzing the entire query logic, the following three main issues need to be
addressed:
Empty set exception when dimension column cropping: the problem in the work
order, is due to the existence of dimension columns in the filter field, and
the collection of expressions on the extracted dimension columns is empty,
resulting in an Exception, so there is a need to be able to return expressions
that can be estimated in certain scenarios, as in this scenario CASE;
Cannot override the case where the left expression in a comparison expression
or IN/INSET expression is a non-simple Attribute expression, or where the right
expression is a literal or estimable non-literal value;
Code optimization: address filtering performance on In/InSet expressions, very
large sets; optimize the process of dim/shard column cropping;
Dev Design
Option 1: Add support for `(cast expr as alias) in (...)` expression in the
method, convert it to `equivalent expression`, so that it can be optimized by
the optimizer in the subsequent process, so you can refer to OptimizeIn& in
Spark. UnwrapCastInBinaryComparision optimization logic in Spark and add
corresponding processing rules in `FilterPrunner`.
Option 2: In expression expansion, expand it into an equivalent expression so
that you don't have to consider the IN scenario.
> For ‘in’ condition query of non-time partition columns, when the data type of
> the value in 'in' condition is inconsistent with that of the non-time
> partition column, the segment pruner fails, resulting in full Segment scanning
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: KYLIN-5704
> URL: https://issues.apache.org/jira/browse/KYLIN-5704
> Project: Kylin
> Issue Type: Bug
> Affects Versions: 5.0-beta
> Reporter: Hongrong Cao
> Assignee: Guangyuan Feng
> Priority: Major
> Fix For: 5.0.0
>
>
> The query column is a non-time partition column, a common dimension column,
> and the filter condition of the common dimension column is col in (x1, x2...)
> In this case (and because the col and x1 types do not match, it is
> automatically converted to (cast col as string) in (x1,x2..), Fileprunner
> will report an error because
> org.apache.spark.sql.execution.datasource.FilePruner#convertCastFilter does
> not handle in.
> Explain that the convertCastFilter method is to remove the cast condition, so
> that the filter condition can be matched when calling
> DataSourceStrategy.translateFilter, and then the Segment can be filtered.
> However, currently convertCastFilter misses the processing of the in
> condition, so translateFilter cannot match and becomes empty, so The query
> was thrown incorrectly.
> In addition: if it is a time partition column, it does not matter if an error
> is reported here, because in the previous steps, the calcite file prunner has
> already completed the Segment Prune of the time partition column.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)