[ https://issues.apache.org/jira/browse/KYLIN-3540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zhong Yanghong resolved KYLIN-3540. ----------------------------------- Resolution: Resolved > Improve Mandatory Cuboid Recommendation Algorithm > ------------------------------------------------- > > Key: KYLIN-3540 > URL: https://issues.apache.org/jira/browse/KYLIN-3540 > Project: Kylin > Issue Type: Improvement > Reporter: Zhong Yanghong > Assignee: Zhong Yanghong > Priority: Major > Fix For: v2.6.0 > > > Previously to add cuboids which are not prebuilt, the cube planner turns to > mandatory cuboids which are selected if its rollup row count is above some > threshold. There are two shortcomings: > * The way to estimate the rollup row count is not good > * It's hard to determine the threshold of rollup row count for recommending > mandatory cuboids > bq. {color:#f79232}The improved way to estimate the rollup row count is as > follows:{color} > Current criteria to recommend mandatory cuboids is based on the average > rollup count collected with query metrics. There's a disadvantage. An example > is as follows: > Cuboid (A,B) has 1000 rows, prebuilt; Cuboid (B) has 10 rows, not prebuilt; > The ground truth for the rollup count from Cuboid (A,B) to Cuboid (B) is > {code} > Cuboid (A,B) - Cuboid (A) = 1000 - 10 = 990 > {code} > Suppose B is evenly composed with A. Then for each value of B with A, the row > count is 1000 * (10/100) = 100. > Now for sql > {code} > select B, count(*) > from T > where B = 'e1' > group by B > {code} > Then the rollup count by current algorithm will be > {code} > Cuboid (A,{'e1'}) - return count = 100 - 1 = 99 > {code} > which is much smaller than 990 due to the influence of lots of filtered row > count. > It's better to calculate the rollup rate first and then multiple the parent > cuboid row count to estimate the rollup count. The refined formula is as > follows: > {code} > Cuboid (A,B) - Cuboid (A,B) * (return count) / Cuboid (A,{'e1'}) = > 1000-1000*1/100 = 990 > {code} > Another sql > {code} > select count(*) > from T > where B in {'e1','e2'} > {code} > The rollup count by current algorithm will be > {code} > Cuboid (A,{'e1','e2'}) - return count = 100*2 - 1 = 199 > {code} > The rollup count by refined algorithm will be > {code} > Cuboid (A,B) - Cuboid (A,B) * (return count) / Cuboid (A,{'e1','e2'}) = > 1000-1000*1/(100*2) = 995 > {code} > Above all, the refined algorithm will be much less influenced by filters in > sql. > bq. {color:#f79232}Don't recommend mandatory cuboids & don't need the > threshold > {color} > Previously the reason to recommend mandatory cuboids is that they are not > prebuilt and their row count statistics are not known, which causes it's not > possible to apply cube planner algorithm for them. Now by the improved way of > estimating rollup row count, we can better estimate the row count statistics > for those cuboids which are not prebuilt. Then the cost-based cube planner > algorithm will decide which cuboid to be built or not and the threshold is > not needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)