Looking forward to it, I believe it will bring a great performance improvement.
| | weibin0516 | | [email protected] Best wishes ! | 签名由网易邮箱大师定制 On 01/19/2020 22:22,George Ni<[email protected]> wrote: Hi Kylin users & developers, By-layer Spark Cubing has been introduced into Apache Kylin since v2.0 to achieve better performance and it does run much faster compared to MR engine. Also Hbase has been Kylin’s trustful storage engine since Kylin was born and it has been proved to be a success for providing the ability to handle high concurrency queries in extremely large data scale with low latency. But there are also limitations for HBase, such as filtering is not flexible as we could only filter by RowKey, measures are usually combined together which causes more data to be scanned than requested. So in order to optimize Kylin in both building strategy and storage engine, development team of Kyligence is introducing a new cube building engine which uses Spark Sql to construct cuboids with a new strategy and stores cube results in Parquet files. The building strategy allows Kylin to build cuboids in a smarter way by choosing and building on the optimal cuboid source. And Parquet, a columnar storage format available to any project in the Hadoop ecosystem, will power the filtering ability with the page-level column index and reduce I/O by saving measures in different columns. Also with Storing cuboid in Parquet instead of Hbase, we can utilize Kylin in Cloud Native way. More information on design and technique details will come soon. Below is the comparison in building duration and size of results between By-layer Spark Cubing and the new cubing strategy. Environment 4-nodes Hadoop cluster YRAN has 400GB RAM and 128 cores in total; CDH 5.1, Apache Kylin 3.0. Spark Spark 2.4.1-kylin-r17 Test Data SSB data Cube: 15 dimensions, 3 measures (SUM) Test Scenarios Build the cube at different source size level: 30 million, 60 million source rows; Compare the build time with Spark (by layer) + Hbase and SparkSql + Parquet. Besides, we attempt to resolve many drawbacks in current query engine, which relies heavily on Apache Calcite, such as the performance bottleneck in aggregating large query results which currently can only be operated by a single worker. By embracing SparkSql, this kind of expensive computing can be done distributedly. Also combined with Parquet format, plenty of filtering optimizations could be applied,which will boost Kylin’s query performance significantly. The features will be open source along with technique details in the near future. - https://issues.apache.org/jira/browse/KYLIN-4188 -- --------------------- Best regards, Ni Chunen / George
