[jira] [Commented] (KYLIN-3975) Can kylin accelerate query speed for natural week or natural month report?

2019-04-24 Thread zhao jintao (JIRA)


[ 
https://issues.apache.org/jira/browse/KYLIN-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16824848#comment-16824848
 ] 

zhao jintao commented on KYLIN-3975:


Hi Shaofeng:

Yes, your idea is a good idea to optimize. I think it is a possible plan to 
improve query performance. I have another idea to optimize just like yours.

Let's see a example.
Assume there is a normal cube which has 5 dimensions: dimA, dimB, dmonth, dweek 
and dt, "dt" is partition date dimension, "dmonth、dweek" means month/week 
dimension, ; it has 3 measures: count, sum and count distinct.

The cube builds everyday and merges by default settings (7days/28days). The the 
cube may have 4 week segments and 2 date segments:

segA: [20190331-20190407] 
segB: [20190407-20190414]
segC: [20190414-20190421] 
segD: [20190421-20190428]
segE: [20190428-20190505]
segF: [20190505-20190507]

Assume there is a nature month query:"select dimA, dimB, dmonth, sum, 
count(distinct y) from fact_table where dt >= '2019-04-01' and dt < 
'2019-05-01' group by dmonth". Kylin match the cuboid "dim1+dim2+dmonth+dt". 
Kylin seek data from 5 segments and aggregate 30 days value. Kylin query slowly 
if the aggrerate data is large.

Assume there is another nature query:"select dimA, dimB, dweek, sum, 
count(distinct y) from fact_table where dt >= '2019-03-31' and dt < 
'2019-04-22' group by dweek". Kylin match the cuboid "dim1+dim2+dweek+dt"; 
kylin seek data from 3 segments and aggregate 21 days value. Kylin query slowly 
if the aggregate data is large.

As for the nature week report or nature month report, the partition date scope 
is nature week or nature month.So I have an idea optimize kylin query 
performance:

1. Kylin should have a normal date cube, then add a week cube and a month cube. 
The week cube depends date cube and add a week dimension more, the week cube 
build every day just merge aggregate data from the date cube. The week cube has 
only one segment every nature week. As the same, the month cube depends date 
cube and add a dmonth dimension more, the month cube build every day just merge 
aggregate data from the date cube. The month cube has only one segment every 
nature month.

2. Kylin change the execution plan to use the cuboid that has no "dt" but has 
"dmonth" or "dweek". In the nature month queries, kylin can seek data from 
month cube and match cuboid with "dmonth" without "dt" . In the normal week 
queries, kylin can seek data from week cube and match cuboid with "week" 
without "dt".
 
For example,assume a month cube has month segments:
monthSegA: [20190301-20190401] 
monthSegB: [20190401-20190501]
Assume a nature month query sql is :"select dimA, dimB, dmonth, sum, 
count(distinct y) from fact_table where dt >= '2019-03-01' and dt < 
'2019-05-01' group by dmonth". Kylin seek data from month cube and match 
cuboid: dimA+dimB+dweek, whose size is much smaller than the previous one. It 
must be faster than the normal cube.
assume a week cube has week segments:
weekSegA: [20190331-20190407] 
weekSegB: [20190407-20190414]
The nature week query sql is : "select dimA, dimB, dweek, sum, count(distinct 
y) from fact_table where dt >= '2019-03-31' and dt < '2019-04-14' group by 
dweek". Kylin seek data from week cube and match cuboid: dimA+dimB+dweek, whose 
size is much smaller than the previous one.It must be faster than the normal 
cube.


This optimization is only useful in nature month query or nature week query. 
But in most cases, we conduct big data analysis from the perspective of nature 
week or nature month. I think after this optimization, kylin query performance 
can be much efficient than before in nature week query or in nature month query.

 

Is my idea feasible and effective? Thanks.

> Can kylin accelerate  query speed for natural week or natural month report?
> ---
>
> Key: KYLIN-3975
> URL: https://issues.apache.org/jira/browse/KYLIN-3975
> Project: Kylin
>  Issue Type: New Feature
>  Components: Job Engine, Query Engine
>Reporter: zhao jintao
>Priority: Major
>
> Hi team:
> In bigdata analytics platform, we often query data of the nature week or 
> nature month.
>  For example, in Bank or Accounting reports, the query periods are often a 
> natural week or natural month report.
>  In kylin system, we can build cube to increase query speed. However, it will 
> query slowly if the amount of data is large and the query cycle is long 
> especlially using count distinct measure.
> For example, We can add month dimension to the cube, then merge cube in 
> normal month peroid; but if the query sql has date partition, it will also 
> match the cube has both week dimension and date dimension, kylin need search 
> data from HBase and aggregate data in memory. It also slowly if the amountof 
> data is 

[jira] [Commented] (KYLIN-3975) Can kylin accelerate query speed for natural week or natural month report?

2019-04-23 Thread Shaofeng SHI (JIRA)


[ 
https://issues.apache.org/jira/browse/KYLIN-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16824714#comment-16824714
 ] 

Shaofeng SHI commented on KYLIN-3975:
-

Hi Jintao,

 

Let me try to understand: say a cube has three segments:

seg1: [2019-01-20 to 2019-02-01)

seg2:  [2019-02-01 to 2019-03-01)

seg3: [2019-03-01 to 2019-03-10)

 

Assume there is a query:

select dim1, dim2, sum(x), count(distinct y) from fact_table where dt > 
'2019-01-25' and dt < '2019-03-10'

For this query, Kylin will scan all the three segments by checking the 
partition date's max/min value, and the selected cuboid will have "dt"; If 
'month'-'week'-'dt' is defined as a hierarchy, the cuboid will have all of 
them. In this case, the cuboid will be "dim1+dim2+month+week+dt"

 

Here we can see, although the data in these segments have been merged to month 
level, they won't be used because the query condition is on "dt". So the 
performance is not that perfect as we expected.

A potential optimization is, if a segment is totally in the partition date 
scope (in this case, seg2 and seg3), and the partition date is only used as a 
filtering condition (not in group by), Kylin can change the execution plan to 
use the cuboid that has no "dt". In this case it will be optimized to 
"dim1+dim2", whose size is much smaller than the previous one, and the query 
performance can be much efficient than before as the aggregation has already 
been done in cube.

 

Is this what you want to discuss? or any better idea? Thanks.

 

> Can kylin accelerate  query speed for natural week or natural month report?
> ---
>
> Key: KYLIN-3975
> URL: https://issues.apache.org/jira/browse/KYLIN-3975
> Project: Kylin
>  Issue Type: New Feature
>  Components: Job Engine, Query Engine
>Reporter: zhao jintao
>Priority: Major
>
> Hi team:
> In bigdata analytics platform, we often query data of the nature week or 
> nature month.
>  For example, in Bank or Accounting reports, the query periods are often a 
> natural week or natural month report.
>  In kylin system, we can build cube to increase query speed. However, it will 
> query slowly if the amount of data is large and the query cycle is long 
> especlially using count distinct measure.
> For example, We can add month dimension to the cube, then merge cube in 
> normal month peroid; but if the query sql has date partition, it will also 
> match the cube has both week dimension and date dimension, kylin need search 
> data from HBase and aggregate data in memory. It also slowly if the amountof 
> data is large.
> Does anyone face the same problem? Who has a better way to solve the problems 
> of nature week or nature month query?
>  
> Best regards
> Thank you.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)