Great news! I can foresee Kylin could be in a more Cloud-Native way after the mature of parquet storage. And I wish the developer team will share more detail for its desgin.
-- Best wishes to you ! From :Xiaoxiang Yu At 2020-01-19 22:22:30, "George Ni" <[email protected]> wrote: >Hi Kylin users & developers, > >By-layer Spark Cubing has been introduced into Apache Kylin since v2.0 to >achieve better performance and it does run much faster compared to MR >engine. Also Hbase has been Kylin’s trustful storage engine since Kylin was >born and it has been proved to be a success for providing the ability to >handle high concurrency queries in extremely large data scale with low >latency. But there are also limitations for HBase, such as filtering is not >flexible as we could only filter by RowKey, measures are usually combined >together which causes more data to be scanned than requested. > > > >So in order to optimize Kylin in both building strategy and storage engine, >development team of Kyligence is introducing a new cube building engine >which uses Spark Sql to construct cuboids with a new strategy and stores >cube results in Parquet files. The building strategy allows Kylin to build >cuboids in a smarter way by choosing and building on the optimal cuboid >source. And Parquet, a columnar storage format available to any project in >the Hadoop ecosystem, will power the filtering ability with the page-level >column index and reduce I/O by saving measures in different columns. Also >with Storing cuboid in Parquet instead of Hbase, we can utilize Kylin in >Cloud Native way. More information on design and technique details will >come soon. > > > >Below is the comparison in building duration and size of results between >By-layer Spark Cubing and the new cubing strategy. > > > >Environment > >4-nodes Hadoop cluster > >YRAN has 400GB RAM and 128 cores in total; > >CDH 5.1, Apache Kylin 3.0. > > > >Spark > >Spark 2.4.1-kylin-r17 > > > >Test Data > >SSB data > >Cube: 15 dimensions, 3 measures (SUM) > > > >Test Scenarios > >Build the cube at different source size level: 30 million, 60 million >source rows; Compare the build time with Spark (by layer) + Hbase and >SparkSql + Parquet. > > >Besides, we attempt to resolve many drawbacks in current query engine, >which relies heavily on Apache Calcite, such as the performance bottleneck >in aggregating large query results which currently can only be operated by >a single worker. By embracing SparkSql, this kind of expensive computing >can be done distributedly. Also combined with Parquet format, plenty of >filtering optimizations could be applied,which will boost Kylin’s query >performance significantly. The features will be open source along with >technique details in the near future. > > > > - https://issues.apache.org/jira/browse/KYLIN-4188 > > >-- > >--------------------- > >Best regards, > > > >Ni Chunen / George
