[ 
https://issues.apache.org/jira/browse/KYLIN-3540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhong Yanghong resolved KYLIN-3540.
-----------------------------------
    Resolution: Resolved

> Improve Mandatory Cuboid Recommendation Algorithm
> -------------------------------------------------
>
>                 Key: KYLIN-3540
>                 URL: https://issues.apache.org/jira/browse/KYLIN-3540
>             Project: Kylin
>          Issue Type: Improvement
>            Reporter: Zhong Yanghong
>            Assignee: Zhong Yanghong
>            Priority: Major
>             Fix For: v2.6.0
>
>
> Previously to add cuboids which are not prebuilt,  the cube planner turns to 
> mandatory cuboids which are selected if its rollup row count is above some 
> threshold. There are two shortcomings:
> * The way to estimate the rollup row count is not good
> * It's hard to determine the threshold of rollup row count for recommending 
> mandatory cuboids
> bq. {color:#f79232}The improved way to estimate the rollup row count is as 
> follows:{color}
> Current criteria to recommend mandatory cuboids is based on the average 
> rollup count collected with query metrics. There's a disadvantage. An example 
> is as follows:
> Cuboid (A,B) has 1000 rows, prebuilt; Cuboid (B) has 10 rows, not prebuilt; 
> The ground truth for the rollup count from Cuboid (A,B) to Cuboid (B) is
> {code}
> Cuboid (A,B) - Cuboid (A) = 1000 - 10 = 990
> {code}
> Suppose B is evenly composed with A. Then for each value of B with A, the row 
> count is 1000 * (10/100) = 100.
> Now for sql 
> {code}
> select B, count(*)
> from T
> where B = 'e1'
> group by B
> {code}
> Then the rollup count by current algorithm will be
> {code}
> Cuboid (A,{'e1'}) - return count = 100 - 1 = 99
> {code}
> which is much smaller than 990 due to the influence of lots of filtered row 
> count.
> It's better to calculate the rollup rate first and then multiple the parent 
> cuboid row count to estimate the rollup count. The refined formula is as 
> follows:
> {code}
> Cuboid (A,B) - Cuboid (A,B) * (return count) / Cuboid (A,{'e1'}) = 
> 1000-1000*1/100 = 990
> {code}
> Another sql
> {code}
> select count(*)
> from T
> where B in {'e1','e2'}
> {code}
> The rollup count by current algorithm will be
> {code}
> Cuboid (A,{'e1','e2'}) - return count = 100*2 - 1 = 199
> {code}
> The rollup count by refined algorithm will be
> {code}
> Cuboid (A,B) - Cuboid (A,B) * (return count) / Cuboid (A,{'e1','e2'}) = 
> 1000-1000*1/(100*2) = 995
> {code}
> Above all, the refined algorithm will be much less influenced by filters in 
> sql.
> bq. {color:#f79232}Don't recommend mandatory cuboids & don't need the 
> threshold
> {color}
> Previously the reason to recommend mandatory cuboids is that they are not 
> prebuilt and their row count statistics are not known, which causes it's not 
> possible to apply cube planner algorithm for them. Now by the improved way of 
> estimating rollup row count, we can better estimate the row count statistics 
> for those cuboids which are not prebuilt. Then the cost-based cube planner 
> algorithm will decide which cuboid to be built or not and the threshold is 
> not needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to