RE: Few Questions about Kylin Ability

Santoshakhilesh Sun, 10 Jul 2016 23:32:42 -0700

Hi Yang ,
   For my case I think stability of KYLIN-1726 is important. I will study the 
code to see if I can do something about it.
Regards,
Santosh Akhilesh

From: Li Yang [mailto:[email protected]]
Sent: 08 July 2016 12:24
To: [email protected]
Cc: Santoshakhilesh
Subject: Re: Few Questions about Kylin Ability

Hi Santosh
Kylin support most of your requirement with following limitations/notes.
- The streaming cubing is still in experiment stage. 
KYLIN-1726<https://issues.apache.org/jira/browse/KYLIN-1726> will improve 
realtime analysis significantly but it may take a few month to complete.
- Kylin does not store raw data by default. There is RAW 
measure<http://kylin.apache.org/blog/2016/05/29/raw-measure-in-kylin/> that can 
store raw data to some extent. But it also has volume limitation.
- The query speed will largely depend on your dimension cardinality (not the 
data volume) and if the cube is well defined and optimized for your query.
- Finally the capacity of your cluster always plays an important part.
Kylin uses micro batch to build streaming cube. A small job is kicked off every 
5 minutes for example and build cube incrementally with the input of last 5 
minutes. The job currently runs one a single node which is not very scale-able. 
 KYLIN-1726 will solve this problem.
Kylin support all kinds of SQL queries as long as they are within defined data 
model.
Cheers
Yang

On Sat, Jul 2, 2016 at 5:55 PM, Santosh Akhilesh 
<[email protected]<mailto:[email protected]>> wrote:
Hi All ,
Last year I had done a PoC for one of our products using Kylin. Our distributed 
architecture journey was on hold for some time but now we are back again to 
rearchitect our system to distributed. I am writing this mail to understand how 
and whether Kylin can fit in to our requirements.
Let me give background of our requirement.
Ours is a network performance management solution which needs to handle 
following scenes.

1. Collect data from network elements in granularity between 30 sec to 5 minute 
period. Every period we collect around 150Million KPIs Which are distributed 
across different service type. The service types are model driven and can 
change over period of time.
2. Data which we collect needs to available for Adhoc and OLAP type query ASAP. 
For example data collected between 10:00 and 10:05 for 5 mins period should be 
available for reports to fire query by 10:06. Query will involve joining 
performance data with inventory data and also have filters like query data for 
Area = Area1 and we also need sort by KPI or property of inventory with order 
by Clause
3. We also need OLAP type query like group by area , province , country etc... 
and needs to apply sum , max , min , avg aggregator. We also need to generate 
Top talkers report which means we need Top N function.
4. There will be background machine learning jobs which need to scan raw and 
aggregated data.
5. We would be generating around 5-10 TB of data every day and In future may be 
more.
Now my questions are these. We need to retain data for several days and months 
based on aggregation period.
6. Adhoc and OLAP query from report should take < 2 seconds.
So my questions are;

1. Which of the use cases Kylin can support?
2. How long cube building takes and how does it handle the data which will be 
appended every 30 sec or 5 minutes.
3. Can Kylin support both Adhoc query and OLAP query ?

I have several other questions but I would like to initiate the discussion with 
these.
We plan to start a test next week with Kylin I am just setting up a cluster 
now. We don't plan to use cloud era or Horton work sandbox as our company has 
its own sandbox.

Appreciate response from Kylin experts.

Regards
Santosh

Sent from my iPhone

RE: Few Questions about Kylin Ability

Reply via email to