Hi All Apache Kylin has been stable for quite a while and it may be a good time to think about the future of it. Below are thoughts from my team and myself. Love to hear yours as well. Ideas and comments are very welcome. :-)
*APACHE KYLIN TODAY* Currently, the latest release of Apache Kylin is 4.0.1. Apache Kylin 4.0 is a major version update after Kylin 3.x (HBase Storage). Kylin 4.0 uses Parquet to replace HBase as storage engine, so as to improve file scanning performance. At the same time, Kylin 4.0 reimplements the spark based build engine and query engine, making it possible to separate computing and storage, and better adapt to the technology trend of cloud native. Kylin 4.0 comprehensively updated the build and query engine, realized the deployment mode without Hadoop dependency, decreasing the complexity of deployment. However, Kylin also has a lot to improve, such as the ability of business semantic layer needs to be strengthened and the modification of model/cube is not flexible. With these, we thinking a few things to do: - Multi-dimensional query ability friendly to non-technical personnel. Multi-dimensional model is the key to distinguish Kylin from the general OLAP engines. The feature is that the model concept based on dimension and measurement is more friendly to non-technical personnel and closer to the goal of citizen analyst. The multi-dimensional query capability that non-technical personnel can use should be the new focus of Kylin technology. - Native Engine. The query engine of Kylin still has much room for improvement in vector acceleration and cpu instruction level optimization. The Spark community Kylin relies on also has a strong demand for native engine. It is optimistic that native engine can improve the performance of Kylin by at least three times, which is worthy of investment. - More cloud native capabilities. Kylin 4.0 has only completed the initial cloud deployment and realized the features of rapid deployment and dynamic resource scaling on the cloud, but there are still many cloud native capabilities to be developed. More explanations are following. *KYLIN AS A MULTI-DIMENSIONAL DATABASE* The core of Kylin is a multi-dimensional database, which is a special OLAP engine. Although Kylin has always had the ability of a relational database since its birth, and it is often compared with other relational OLAP engines, what really makes Kylin different is multi-dimensional model and multi-dimensional database ability. Considering the essence of Kylin and its wide range of business uses in the future (not only technical uses), positioning Kylin as a multi-dimensional database makes perfect sense. With business semantics and precomputation technology, Apache Kylin helps non-technical people understand and afford big data, and realizes data democratization. *THE SEMANTIC LAYER* The key difference between the multi-dimensional database and the relational database is business expression ability. Although SQL has strong expression ability and is the basic skill of data analysts, SQL and the RDB are still too difficult for non-technical personnel if we aim at "everyone is a data analyst". From the perspective of non-technical personnel, the data lake and data warehouse are like a dark room. They know that there is a lot of data, but they can't see clearly, understand and use this data because they don't understand database theory and SQL. How to make the Data Lake (and data warehouse) clear to non-technical personnel? This requires introducing a more friendly data model for non-technical personnel — multi-dimensional data model. While the relational model describes the technical form of data, the multi-dimensional model describes the business form of data. In a MDB, measurement corresponds to business indicators that everyone understands, and dimension is the perspective of comparing and observing these business indicators. Compare KPI with last month and compare performance between parallel business units, which are concepts understood by every non-technical personnel. By mapping the relational model to the multi-dimensional model, the essence is to enhance the business semantics on the technical data, form a business semantic layer, and help non-technical personnel understand, explore and use the data. In order to enhance Kylin's ability as the semantic layer, supporting multi-dimensional query language is the key content of Kylin roadmap, such as MDX and DAX. MDX can transform the data model in Kylin into a business friendly language, endow data with business value, and facilitate Kylin's multi-dimensional analysis with BI tools such as Excel and Tableau. *PRECOMPUTATION AND MODEL FLEXIBILITY* It is kylin's unchanging mission to continue to reduce the cost of a single query through precomputation technology so that ordinary people can afford big data. If the multi-dimensional model solves the problem that non-technical personnel can understand data, then precomputation can solve the problem that ordinary people can afford data. Both are necessary conditions for data democratization. Through one calculation and multiple use, the data cost can be shared by multiple users to achieve the scale effect that the more users, the cheaper. Precalculation is Kylin's traditional strength, but it lacks some flexibility in the change of precalculation model. In order to strengthen the ability to change models flexibly of Kylin and bring more optimization room, Kylin community expects to propose a new metadata format in Kylin in the future to make precalculation more flexible, be able to cope with that table format or business requirements may change at any time. *SUMMARY* To sum up, we would like to propose Kylin as a multi-dimensional database. Through multi-dimensional model and precomputation technology, ordinary people can understand and afford big data, and finally realize the vision of data democratization. Meanwhile, for today's users who use Kylin as the SQL acceleration layer, Kylin will continue to enhance its SQL engine, to ensure that the precomputation technology can be used by both relational model and multi-dimensional model. In the figure below, we picture the future of Kylin. The newly added and modified parts are roughly marked in blue and orange. *FURTHER READING* - https://en.wikipedia.org/wiki/Data_model - https://en.wikipedia.org/wiki/Semantic_layer - https://en.wikipedia.org/wiki/Multidimensional_analysis - https://en.wikipedia.org/wiki/MultiDimensional_eXpressions - https://en.wikipedia.org/wiki/XML_for_Analysis - https://en.wikipedia.org/wiki/SIMD - https://en.wikipedia.org/wiki/Cloud_native_computing - https://blogs.gartner.com/carlie-idoine/2018/05/13/citizen-data-scientists-and-why-they-matter/ Please share your ideas and comments. :-) Cheers Yang
