[
https://issues.apache.org/jira/browse/KYLIN-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16824848#comment-16824848
]
zhao jintao commented on KYLIN-3975:
Hi Shaofeng:
Yes, your idea is a good idea to optimize. I think it is a possible plan to
improve query performance. I have another idea to optimize just like yours.
Let's see a example.
Assume there is a normal cube which has 5 dimensions: dimA, dimB, dmonth, dweek
and dt, "dt" is partition date dimension, "dmonth、dweek" means month/week
dimension, ; it has 3 measures: count, sum and count distinct.
The cube builds everyday and merges by default settings (7days/28days). The the
cube may have 4 week segments and 2 date segments:
segA: [20190331-20190407]
segB: [20190407-20190414]
segC: [20190414-20190421]
segD: [20190421-20190428]
segE: [20190428-20190505]
segF: [20190505-20190507]
Assume there is a nature month query:"select dimA, dimB, dmonth, sum,
count(distinct y) from fact_table where dt >= '2019-04-01' and dt <
'2019-05-01' group by dmonth". Kylin match the cuboid "dim1+dim2+dmonth+dt".
Kylin seek data from 5 segments and aggregate 30 days value. Kylin query slowly
if the aggrerate data is large.
Assume there is another nature query:"select dimA, dimB, dweek, sum,
count(distinct y) from fact_table where dt >= '2019-03-31' and dt <
'2019-04-22' group by dweek". Kylin match the cuboid "dim1+dim2+dweek+dt";
kylin seek data from 3 segments and aggregate 21 days value. Kylin query slowly
if the aggregate data is large.
As for the nature week report or nature month report, the partition date scope
is nature week or nature month.So I have an idea optimize kylin query
performance:
1. Kylin should have a normal date cube, then add a week cube and a month cube.
The week cube depends date cube and add a week dimension more, the week cube
build every day just merge aggregate data from the date cube. The week cube has
only one segment every nature week. As the same, the month cube depends date
cube and add a dmonth dimension more, the month cube build every day just merge
aggregate data from the date cube. The month cube has only one segment every
nature month.
2. Kylin change the execution plan to use the cuboid that has no "dt" but has
"dmonth" or "dweek". In the nature month queries, kylin can seek data from
month cube and match cuboid with "dmonth" without "dt" . In the normal week
queries, kylin can seek data from week cube and match cuboid with "week"
without "dt".
For example,assume a month cube has month segments:
monthSegA: [20190301-20190401]
monthSegB: [20190401-20190501]
Assume a nature month query sql is :"select dimA, dimB, dmonth, sum,
count(distinct y) from fact_table where dt >= '2019-03-01' and dt <
'2019-05-01' group by dmonth". Kylin seek data from month cube and match
cuboid: dimA+dimB+dweek, whose size is much smaller than the previous one. It
must be faster than the normal cube.
assume a week cube has week segments:
weekSegA: [20190331-20190407]
weekSegB: [20190407-20190414]
The nature week query sql is : "select dimA, dimB, dweek, sum, count(distinct
y) from fact_table where dt >= '2019-03-31' and dt < '2019-04-14' group by
dweek". Kylin seek data from week cube and match cuboid: dimA+dimB+dweek, whose
size is much smaller than the previous one.It must be faster than the normal
cube.
This optimization is only useful in nature month query or nature week query.
But in most cases, we conduct big data analysis from the perspective of nature
week or nature month. I think after this optimization, kylin query performance
can be much efficient than before in nature week query or in nature month query.
Is my idea feasible and effective? Thanks.
> Can kylin accelerate query speed for natural week or natural month report?
> ---
>
> Key: KYLIN-3975
> URL: https://issues.apache.org/jira/browse/KYLIN-3975
> Project: Kylin
> Issue Type: New Feature
> Components: Job Engine, Query Engine
>Reporter: zhao jintao
>Priority: Major
>
> Hi team:
> In bigdata analytics platform, we often query data of the nature week or
> nature month.
> For example, in Bank or Accounting reports, the query periods are often a
> natural week or natural month report.
> In kylin system, we can build cube to increase query speed. However, it will
> query slowly if the amount of data is large and the query cycle is long
> especlially using count distinct measure.
> For example, We can add month dimension to the cube, then merge cube in
> normal month peroid; but if the query sql has date partition, it will also
> match the cube has both week dimension and date dimension, kylin need search
> data from HBase and aggregate data in memory. It also slowly if the amountof
> data is