[jira] [Comment Edited] (KYLIN-5875) Enhance the simplification method of filter condition for segment pruning

pengfei.zhan (Jira) Mon, 15 Jul 2024 01:00:08 -0700


    [ 
https://issues.apache.org/jira/browse/KYLIN-5875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17865928#comment-17865928
 ]


pengfei.zhan edited comment on KYLIN-5875 at 7/15/24 7:59 AM:
--------------------------------------------------------------

The execution plan generated by VolcanoPlanner can be unstable. Given the 
following sql
{code:sql}
select PART_DT, YEAR (CAL_DT) as y, YEAR (CAL_DT) * 100 + MONTH (CAL_DT) as ym, 
   YEAR (CAL_DT) * 10000 + MONTH (CAL_DT) * 100 + DAYOFMONTH(CAL_DT) as ymd, 
CAL_DT, sum (price)
from SSB_STEP.KYLIN_SALES_SAMPLE a inner join SSB_STEP.KYLIN_CAL_DT_SAMPLE b
on a.PART_DT = b.CAL_DT
where (cast ("CAL_DT" as date) BETWEEN '2012-02-01'
  and '2012-02-29'
  and cast ("CAL_DT" as date) BETWEEN '2012-01-01'
  and '2012-12-31' )
group by PART_DT, CAL_DT
order by PART_DT
    LIMIT 500
{code}

The execution plan you get before kylin5.0-alpha might look like this

{code:java}
raceId: c114c6d5-071e-4cf2-c17b-34c9f22be1ae 2023-11-17T18:51:28,797 INFO  
[project_38744] [Query 7eea9760-219b-dfb5-7433-8ecf4c95bf5e-332] 
util.QueryContextCutter : select layout candidate for 1 olapContext cost 82 ms
traceId: c114c6d5-071e-4cf2-c17b-34c9f22be1ae 2023-11-17T18:51:28,797 INFO  
[project_38744] [Query 7eea9760-219b-dfb5-7433-8ecf4c95bf5e-332] 
exec.SparderPlanExec : OlapToEnumerableConverter
  OlapProjectRel(PART_DT=[$0], Y=[EXTRACT(FLAG(YEAR), $1)], 
YM=[+(*(EXTRACT(FLAG(YEAR), $1), 100), EXTRACT(FLAG(MONTH), $1))], 
YMD=[+(+(*(EXTRACT(FLAG(YEAR), $1), 10000), *(EXTRACT(FLAG(MONTH), $1), 100)), 
EXTRACT(FLAG(DAY), $1))], CAL_DT=[$1], EXPR$5=[$2], ctx=[0@NDataflow 
[model_38744_01]])
    OlapLimitRel(ctx=[0@NDataflow [model_38744_01]], fetch=[500])
      OlapSortRel(sort0=[$0], dir0=[ASC-nulls-first], ctx=[0@NDataflow 
[model_38744_01]])
        OlapAggregateRel(group-set=[[0, 1]], 
groups=[[SSB_STEP.KYLIN_SALES_SAMPLE.PART_DT, 
SSB_STEP.KYLIN_CAL_DT_SAMPLE.CAL_DT]], EXPR$5=[SUM($2)], ctx=[0@NDataflow 
[model_38744_01]])
          OlapProjectRel(PART_DT=[$1], CAL_DT=[$22], PRICE=[$6], 
ctx=[0@NDataflow [model_38744_01]])
            OlapFilterRel(condition=[AND(>=($22, 2012-02-01), <=($22, 
2012-02-29))], ctx=[0@NDataflow [model_38744_01]])
              OlapJoinRel(condition=[=($1, $22)], joinType=[inner], 
ctx=[0@NDataflow [model_38744_01]])
                OlapTableScan(table=[[SSB_STEP, KYLIN_SALES_SAMPLE]], 
ctx=[0@NDataflow [model_38744_01]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 
11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]])
                OlapTableScan(table=[[SSB_STEP, KYLIN_CAL_DT_SAMPLE]], 
ctx=[0@NDataflow [model_38744_01]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 
11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 
31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 
51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 
71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 
91, 92, 93, 94, 95, 96, 97, 98, 99]])
{code}

The execution plan you get at kylin-5.0.0 might look like this

{code:java}
[49047] 2023-11-17T19:11:50,375 INFO  [SuggestRunner-p-5-t-1] 
util.QueryContextCutter : Cut context OlapToEnumerableConverter
  OlapProjectRel(PART_DT=[$0], Y=[EXTRACT(FLAG(YEAR), $1)], 
YM=[+(*(EXTRACT(FLAG(YEAR), $1), 100), EXTRACT(FLAG(MONTH), $1))], 
YMD=[+(+(*(EXTRACT(FLAG(YEAR), $1), 10000), *(EXTRACT(FLAG(MONTH), $1), 100)), 
EXTRACT(FLAG(DAY), $1))], CAL_DT=[$1], EXPR$5=[$2], ctx=[0@null])
    OlapLimitRel(ctx=[0@null], fetch=[500])
      OlapSortRel(sort0=[$0], dir0=[ASC-nulls-first], ctx=[0@null])
        OlapAggregateRel(group-set=[[0, 1]], groups=[null], EXPR$5=[SUM($2)], 
ctx=[0@null])
          OlapProjectRel(PART_DT=[$1], CAL_DT=[$12], PRICE=[$6], ctx=[0@null])
            OlapFilterRel(condition=[AND(>=(CAST($12):DATE, 
CAST('2012-02-01'):DATE NOT NULL), <=(CAST($12):DATE, CAST('2012-02-29'):DATE 
NOT NULL), >=(CAST($12):DATE, CAST('2012-01-01'):DATE NOT NULL), 
<=(CAST($12):DATE, CAST('2012-12-31'):DATE NOT NULL))], ctx=[0@null])
              OlapJoinRel(condition=[=($1, $12)], joinType=[inner], 
ctx=[0@null])
                OlapTableScan(table=[[SSB_STEP, KYLIN_SALES_SAMPLE]], 
ctx=[0@null], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]])
                OlapTableScan(table=[[SSB_STEP, KYLIN_CAL_DT_SAMPLE]], 
ctx=[0@null], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 
36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 
56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 
76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 
96, 97, 98, 99]])
{code}


It can be seen that the contents of OlapFilterRel in these two execution plans 
are different, but completely equivalent. And VolcanoPlanner gives what it 
thinks the best execution plan is actually different, depending on the 
execution cost of the execution plan. The previous version got the first 
execution plan, while the new version gets the second, and the cost estimate 
depends on the matching of column types and constant types. It looks like the 
new execution plan is less costly when compared to the same type. Currently 
kylin does not have a particularly good way of evaluating the cost of execution 
plan selection, so renaming an execution plan or adding a new execution plan 
may cause the execution plan to become unstable. We don't need to be too 
concerned about this - is it possible that a change in calcite or kylin's code, 
and the evaluation of the cost, caused the first plan to be chosen again? Is it 
possible, and how can we avoid this problem?

Looking back, what does this condition do? In kylin, it's used for segment 
pruning or partition pruning, so to start with, it turns out that kylin has 
constructed a class FilterConditionExpander to generate some simplified 
expressions. For example, the above condition will eventually be simplified to 
AND(>=($12, date '2012-02-01'), <=($12, date '2012-02-29'), >=($12, date 
'2012-01-01'), <=($12, date '2012-12-31'). However, the class definition of 
simpleCall is incomplete, and the following two are recognized
* cast(column as datatype) <op> literal
* column <op> literal
The following is not considered a simpleCall
* cast(column as datatype) <op> cast(literal as datatype)
* column <op> cast(literal as datatype)
Above, column means column, literal means constant, and <op> means comparison 
operator.

h1. 1st
 add support for two types of simpleCall that are not supported, i.e. extract 
literal from cast(literal as datatype), and eventually all four cases will be 
converted to column <op> literal for subsequent segment pruning.

h1. 2nd
ReduceExpressionRule optimization rules can only optimize for LogicalProject, 
LogicalJoin, LogicalFilter. OlapProjectRel, OlapJoinRel, and OlapFilterRel do 
not inherit from these three classes, so the optimization fails.
Define OlapReduceExpressionRule rule to override ReduceExpressionRule rule for 
OlapRel. This is why some previous code had expressions that should have been 
optimized in theory but were not.

h1. 3rd
There is a lot of useless transformation logic in the SegmentPruningRule, which 
leads to slow execution on this side of the SegmentPruningRule, and can be 
occasional in CI.
Fixes: 1) For date string to be converted to timestamp constant, we need to 
judge whether it has hours, minutes and seconds, if not, we need to fill in the 
blanks, otherwise it may lead to the RexSimplify#simplifyAnds call to take a 
long time; 2) FilterConditionExpander class converts RexNode to CNF expression, 
and then it converts each sub-CNF expression into a CNF expression, and then it 
converts each sub-CNF expression into a CNF expression. expression and extract 
OlapContext#expandedFilterConditions information for each sub-CNF expression, 
there are a large number of repeated calls, poor performance, the use of a map 
in a computation to avoid the operation of creating the same object over and 
over again.

h1. 4th
For the following sql, from 
NBitmapFunctionForCalciteExecTest#testIntersectCountForFalseFilter


{code:sql}
select
intersect_count_v2(TEST_COUNT_DISTINCT_BITMAP, LSTG_FORMAT_NAME, 
    array['FP-.*GTC', 'Others'], 'REGEXP') as b, 
intersect_count_v2(TEST_COUNT_DISTINCT_BITMAP, LSTG_FORMAT_NAME, 
    array['FP-GTC|FP-non GTC', 'Others'], 'RAWSTRING') as c 
from test_kylin_fact where 1=2
{code}

Since the second problem has been solved to simplify the filter condition 
taking effect, the onMatch method of FilterReduceExpressionsRule is found to be

{code:java}
Values values = (Values) createEmptyRelOrEquivalent(call, filter);
call.transformTo(values.copy(call.rel(0).getTraitSet(), values.getInputs()));
{code}

to

{code:java}
 call.transformTo(createEmptyRelOrEquivalent(call, filter));
{code}

There is no need to copy, copying leads to incorrect traitSet and UT failure.





was (Author: JIRAUSER294653):
The execution plan generated by VolcanoPlanner can be unstable. Given the 
following sql
{code:sql}
select PART_DT, YEAR (CAL_DT) as y, YEAR (CAL_DT) * 100 + MONTH (CAL_DT) as ym, 
   YEAR (CAL_DT) * 10000 + MONTH (CAL_DT) * 100 + DAYOFMONTH(CAL_DT) as ymd, 
CAL_DT, sum (price)
from SSB_STEP.KYLIN_SALES_SAMPLE a inner join SSB_STEP.KYLIN_CAL_DT_SAMPLE b
on a.PART_DT = b.CAL_DT
where (cast ("CAL_DT" as date) BETWEEN '2012-02-01'
  and '2012-02-29'
  and cast ("CAL_DT" as date) BETWEEN '2012-01-01'
  and '2012-12-31' )
group by PART_DT, CAL_DT
order by PART_DT
    LIMIT 500
{code}

The execution plan you get before kylin5.0-alpha might look like this

{code:java}
raceId: c114c6d5-071e-4cf2-c17b-34c9f22be1ae 2023-11-17T18:51:28,797 INFO  
[project_38744] [Query 7eea9760-219b-dfb5-7433-8ecf4c95bf5e-332] 
util.QueryContextCutter : select layout candidate for 1 olapContext cost 82 ms
traceId: c114c6d5-071e-4cf2-c17b-34c9f22be1ae 2023-11-17T18:51:28,797 INFO  
[project_38744] [Query 7eea9760-219b-dfb5-7433-8ecf4c95bf5e-332] 
exec.SparderPlanExec : OlapToEnumerableConverter
  OlapProjectRel(PART_DT=[$0], Y=[EXTRACT(FLAG(YEAR), $1)], 
YM=[+(*(EXTRACT(FLAG(YEAR), $1), 100), EXTRACT(FLAG(MONTH), $1))], 
YMD=[+(+(*(EXTRACT(FLAG(YEAR), $1), 10000), *(EXTRACT(FLAG(MONTH), $1), 100)), 
EXTRACT(FLAG(DAY), $1))], CAL_DT=[$1], EXPR$5=[$2], ctx=[0@NDataflow 
[model_38744_01]])
    OlapLimitRel(ctx=[0@NDataflow [model_38744_01]], fetch=[500])
      OlapSortRel(sort0=[$0], dir0=[ASC-nulls-first], ctx=[0@NDataflow 
[model_38744_01]])
        OlapAggregateRel(group-set=[[0, 1]], 
groups=[[SSB_STEP.KYLIN_SALES_SAMPLE.PART_DT, 
SSB_STEP.KYLIN_CAL_DT_SAMPLE.CAL_DT]], EXPR$5=[SUM($2)], ctx=[0@NDataflow 
[model_38744_01]])
          OlapProjectRel(PART_DT=[$1], CAL_DT=[$22], PRICE=[$6], 
ctx=[0@NDataflow [model_38744_01]])
            OlapFilterRel(condition=[AND(>=($22, 2012-02-01), <=($22, 
2012-02-29))], ctx=[0@NDataflow [model_38744_01]])
              OlapJoinRel(condition=[=($1, $22)], joinType=[inner], 
ctx=[0@NDataflow [model_38744_01]])
                OlapTableScan(table=[[SSB_STEP, KYLIN_SALES_SAMPLE]], 
ctx=[0@NDataflow [model_38744_01]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 
11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]])
                OlapTableScan(table=[[SSB_STEP, KYLIN_CAL_DT_SAMPLE]], 
ctx=[0@NDataflow [model_38744_01]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 
11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 
31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 
51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 
71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 
91, 92, 93, 94, 95, 96, 97, 98, 99]])
{code}

The execution plan you get at kylin-5.0.0 might look like this

{code:java}
[49047] 2023-11-17T19:11:50,375 INFO  [SuggestRunner-p-5-t-1] 
util.QueryContextCutter : Cut context OlapToEnumerableConverter
  OlapProjectRel(PART_DT=[$0], Y=[EXTRACT(FLAG(YEAR), $1)], 
YM=[+(*(EXTRACT(FLAG(YEAR), $1), 100), EXTRACT(FLAG(MONTH), $1))], 
YMD=[+(+(*(EXTRACT(FLAG(YEAR), $1), 10000), *(EXTRACT(FLAG(MONTH), $1), 100)), 
EXTRACT(FLAG(DAY), $1))], CAL_DT=[$1], EXPR$5=[$2], ctx=[0@null])
    OlapLimitRel(ctx=[0@null], fetch=[500])
      OlapSortRel(sort0=[$0], dir0=[ASC-nulls-first], ctx=[0@null])
        OlapAggregateRel(group-set=[[0, 1]], groups=[null], EXPR$5=[SUM($2)], 
ctx=[0@null])
          OlapProjectRel(PART_DT=[$1], CAL_DT=[$12], PRICE=[$6], ctx=[0@null])
            OlapFilterRel(condition=[AND(>=(CAST($12):DATE, 
CAST('2012-02-01'):DATE NOT NULL), <=(CAST($12):DATE, CAST('2012-02-29'):DATE 
NOT NULL), >=(CAST($12):DATE, CAST('2012-01-01'):DATE NOT NULL), 
<=(CAST($12):DATE, CAST('2012-12-31'):DATE NOT NULL))], ctx=[0@null])
              OlapJoinRel(condition=[=($1, $12)], joinType=[inner], 
ctx=[0@null])
                OlapTableScan(table=[[SSB_STEP, KYLIN_SALES_SAMPLE]], 
ctx=[0@null], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]])
                OlapTableScan(table=[[SSB_STEP, KYLIN_CAL_DT_SAMPLE]], 
ctx=[0@null], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 
36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 
56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 
76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 
96, 97, 98, 99]])
{code}


It can be seen that the contents of OlapFilterRel in these two execution plans 
are different, but completely equivalent. And VolcanoPlanner gives what it 
thinks the best execution plan is actually different, depending on the 
execution cost of the execution plan. The previous version got the first 
execution plan, while the new version gets the second, and the cost estimate 
depends on the matching of column types and constant types. It looks like the 
new execution plan is less costly when compared to the same type. Currently 
kylin does not have a particularly good way of evaluating the cost of execution 
plan selection, so renaming an execution plan or adding a new execution plan 
may cause the execution plan to become unstable. We don't need to be too 
concerned about this - is it possible that a change in calcite or kylin's code, 
and the evaluation of the cost, caused the first plan to be chosen again? Is it 
possible, and how can we avoid this problem?

Looking back, what does this condition do? In kylin, it's used for segment 
pruning or partition pruning, so to start with, it turns out that kylin has 
constructed a class FilterConditionExpander to generate some simplified 
expressions. For example, the above condition will eventually be simplified to 
AND(>=($12, date '2012-02-01'), <=($12, date '2012-02-29'), >=($12, date 
'2012-01-01'), <=($12, date '2012-12-31'). However, the class definition of 
simpleCall is incomplete, and the following two are recognized
* cast(column as datatype) <op> literal
* column <op> literal
The following is not considered a simpleCall
* cast(column as datatype) <op> cast(literal as datatype)
* column <op> cast(literal as datatype)
Above, column means column, literal means constant, and <op> means comparison 
operator.

Fix: add support for two types of simpleCall that are not supported, i.e. 
extract literal from cast(literal as datatype), and eventually all four cases 
will be converted to column <op> literal for subsequent segment pruning.




> Enhance the simplification method of filter condition for segment pruning
> -------------------------------------------------------------------------
>
>                 Key: KYLIN-5875
>                 URL: https://issues.apache.org/jira/browse/KYLIN-5875
>             Project: Kylin
>          Issue Type: Improvement
>          Components: Query Engine
>    Affects Versions: 5.0-alpha
>            Reporter: pengfei.zhan
>            Assignee: pengfei.zhan
>            Priority: Major
>             Fix For: 5.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (KYLIN-5875) Enhance the simplification method of filter condition for segment pruning

Reply via email to