Hi dear kylin users and develop team:
Here have some things I want to discuss with community.
As a representative of MOLAP engine, kylin uses pre-aggregation strategies to 
provide high-concurrency and second-level response analysis capabilities, but 
also loses some flexibility.
The limitation that purge existing segment firstly to add an additional measure 
will cause many double calculation and unnecessary disk IO. Such waste should 
be avoid especially in MOLAP engine.
For example, there is an cubeA with one measure m1 and segments over time 
range1(tr1). Now, user add one measure m2, but don't want to clear segments 
over tr1. The value of m2 will exist in tr2, the segments build subsequently. 
Sure, tr1 doesn't contain value of m2, which will be understanded by user who 
know litte about MOLAP. Querying over tr1 and tr2 is valid for both m1 and m2, 
but the result of m2 over tr1 will be null. It's will be better to reminder 
user the measure missing.Moreover, refreshing will supply the m2 to segments 
over tr1.
Currently, kylin's storage engine uses HBase. The measure are aggregated values 
based on combination of various dimension members and stored in a column of a 
Column Family in HBase. For the same cube, adding a new measure will add a 
column to the HBase table(mapping) and will take effect in the next build. For 
the existing HTables(segments), the new column is allowed to be missing. 
Refreshing old existing segments will add a new column in their HTable to store 
new measure. Value of new measure is aggregated according to the combination of 
dimension members in rowkey, without recalculating existing measure.
Now, For additional measure and even additional dimensions, Kylin's current 
solution is Hybrid, but we found the following shortcomings during use:
1. Management costs: Repeated maintenance of similar Cubes, most of which have 
many intersections of dimensions and indicators. If you want to perform 
optimization operations such as pruning, you need to configure all of these 
cubes.
2. A large number of cubes: The initial analysis of the business is not stable, 
and analysts often have the need to increase some measures. The cube is added 
continuously to the Hybrid group, which will produce a lot of cubes.
3. Repeat calculation: If you want to drop the old cube in the Hybrid group, 
you need to build the latest cube by compute historical data to cover the old 
cube.
Those will result in a lot of waste.
In addition, I felt that the metadata about the measure was not perfect during 
the applying of Kylin.
1. As one of the most important concerns of analysts, if the measures of the 
analysis system can be decoupled from the materialized view(cube) and have 
their own management system, it may be more flexibility.
2. Once the dimensions have been choose in cube designing, it's cuboids are 
confirmed no matter the number of measures. It may make confuse to maintenance 
cubes with different measures but same cuboids. Cubes with different cuboids 
should be considered different cube, which is the definition of cube, isn't it?
It's just some thinking about MOLAP during I using kylin. How do you think 
about this? Looking forward your reply, sincerely.
Maybe here are some mistake or misunderstanding, please feel free to correct me 
or discuss further more if you find any of them.
Best regards
yuzhang
 


| |
yuzhang
|
|
shifengdefan...@163.com
|
签名由网易邮箱大师定制

Reply via email to