[DISCUSS] The future of Apache Kylin

Li Yang Mon, 10 Jan 2022 21:59:39 -0800

Hi All

Apache Kylin has been stable for quite a while and it may be a good time to
think about the future of it. Below are thoughts from my team and myself.
Love to hear yours as well. Ideas and comments are very welcome.  :-)

*APACHE KYLIN TODAY*

Currently, the latest release of Apache Kylin is 4.0.1. Apache Kylin 4.0 is
a major version update after Kylin 3.x (HBase Storage). Kylin 4.0 uses
Parquet to replace HBase as storage engine, so as to improve file scanning
performance. At the same time, Kylin 4.0 reimplements the spark based build
engine and query engine, making it possible to separate computing and
storage, and better adapt to the technology trend of cloud native. Kylin
4.0 comprehensively updated the build and query engine, realized the
deployment mode without Hadoop dependency, decreasing the complexity of
deployment. However, Kylin also has a lot to improve, such as the ability
of business semantic layer needs to be strengthened and the modification of
model/cube is not flexible. With these, we thinking a few things to do:

- Multi-dimensional query ability friendly to non-technical personnel.
Multi-dimensional model is the key to distinguish Kylin from the general
OLAP engines. The feature is that the model concept based on dimension and
measurement is more friendly to non-technical personnel and closer to the
goal of citizen analyst. The multi-dimensional query capability that
non-technical personnel can use should be the new focus of Kylin
technology.

- Native Engine. The query engine of Kylin still has much room for
improvement in vector acceleration and cpu instruction level optimization.
The Spark community Kylin relies on also has a strong demand for native
engine. It is optimistic that native engine can improve the performance of
Kylin by at least three times, which is worthy of investment.

- More cloud native capabilities. Kylin 4.0 has only completed the
initial cloud deployment and realized the features of rapid deployment and
dynamic resource scaling on the cloud, but there are still many cloud
native capabilities to be developed.

More explanations are following.

*KYLIN AS A MULTI-DIMENSIONAL DATABASE*

The core of Kylin is a multi-dimensional database, which is a special OLAP
engine. Although Kylin has always had the ability of a relational database
since its birth, and it is often compared with other relational OLAP
engines, what really makes Kylin different is multi-dimensional model and
multi-dimensional database ability. Considering the essence of Kylin and
its wide range of business uses in the future (not only technical uses),
positioning Kylin as a multi-dimensional database makes perfect sense. With
business semantics and precomputation technology, Apache Kylin helps
non-technical people understand and afford big data, and realizes data
democratization.

*THE SEMANTIC LAYER*

The key difference between the multi-dimensional database and the
relational database is business expression ability. Although SQL has strong
expression ability and is the basic skill of data analysts, SQL and the RDB
are still too difficult for non-technical personnel if we aim at "everyone
is a data analyst". From the perspective of non-technical personnel, the
data lake and data warehouse are like a dark room. They know that there is
a lot of data, but they can't see clearly, understand and use this data
because they don't understand database theory and SQL.

How to make the Data Lake (and data warehouse) clear to non-technical
personnel? This requires introducing a more friendly data model for
non-technical personnel — multi-dimensional data model. While the
relational model describes the technical form of data, the
multi-dimensional model describes the business form of data. In a MDB,
measurement corresponds to business indicators that everyone understands,
and dimension is the perspective of comparing and observing these business
indicators. Compare KPI with last month and compare performance between
parallel business units, which are concepts understood by every
non-technical personnel. By mapping the relational model to the
multi-dimensional model, the essence is to enhance the business semantics
on the technical data, form a business semantic layer, and help
non-technical personnel understand, explore and use the data. In order to
enhance Kylin's ability as the semantic layer, supporting multi-dimensional
query language is the key content of Kylin roadmap, such as MDX and DAX.
MDX can transform the data model in Kylin into a business friendly
language, endow data with business value, and facilitate Kylin's
multi-dimensional analysis with BI tools such as Excel and Tableau.

*PRECOMPUTATION AND MODEL FLEXIBILITY*

It is kylin's unchanging mission to continue to reduce the cost of a single
query through precomputation technology so that ordinary people can afford
big data. If the multi-dimensional model solves the problem that
non-technical personnel can understand data, then precomputation can solve
the problem that ordinary people can afford data. Both are necessary
conditions for data democratization. Through one calculation and multiple
use, the data cost can be shared by multiple users to achieve the scale
effect that the more users, the cheaper. Precalculation is Kylin's
traditional strength, but it lacks some flexibility in the change of
precalculation model. In order to strengthen the ability to change models
flexibly of Kylin and bring more optimization room, Kylin community expects
to propose a new metadata format in Kylin in the future to make
precalculation more flexible, be able to cope with that table format or
business requirements may change at any time.

*SUMMARY*

To sum up, we would like to propose Kylin as a multi-dimensional database.
Through multi-dimensional model and precomputation technology, ordinary
people can understand and afford big data, and finally realize the vision
of data democratization. Meanwhile, for today's users who use Kylin as the
SQL acceleration layer, Kylin will continue to enhance its SQL engine, to
ensure that the precomputation technology can be used by both relational
model and multi-dimensional model. In the figure below, we picture the
future of Kylin. The newly added and modified parts are roughly marked in
blue and orange.

*FURTHER READING*

- https://en.wikipedia.org/wiki/Data_model
- https://en.wikipedia.org/wiki/Semantic_layer
- https://en.wikipedia.org/wiki/Multidimensional_analysis
- https://en.wikipedia.org/wiki/MultiDimensional_eXpressions
- https://en.wikipedia.org/wiki/XML_for_Analysis
- https://en.wikipedia.org/wiki/SIMD
- https://en.wikipedia.org/wiki/Cloud_native_computing
-

https://blogs.gartner.com/carlie-idoine/2018/05/13/citizen-data-scientists-and-why-they-matter/

Please share your ideas and comments. :-)

Cheers
Yang

[DISCUSS] The future of Apache Kylin

Reply via email to