Chun en, Thanks for the info. I think we need to discuss more in the community, for example:
1) When the Parquet storage is released (say in Kylin 4.0), will the HBase storage still be kept (co-exist), or totally be replaced? 2) Is there a migration tool for migrating HBase cubes to the new storage? Best regards, Shaofeng Shi 史少锋 Apache Kylin PMC Email: [email protected] Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html Join Kylin user mail group: [email protected] Join Kylin dev mail group: [email protected] nichunen <[email protected]> 于2020年1月20日周一 下午9:38写道: > Hi Shaofeng, > > > Below is our plan for this project, any suggestion will be very welcome. > > > 1. In mid-February of 2020, open source the prototype code of this feature > to branch "kylin-on-parquet-v2", cube can be bulit with new building > engine, and stored with parquet format. > > > 2. In late April of 2020, the query module for the new storage type is > scheduled to be ready, a happy path for cube creation, building and query > will be available then. > > > 3. In May or June of 2020, a Beta version (Kylin 4.0?) will be released. > > > > Best regards, > > > > Ni Chunen / George > > > > On 01/20/2020 16:00,ShaoFeng Shi<[email protected]> wrote: > Hi, Chun en, > > Thanks for the information. What's the detailed release plan of this > feature to the community? > > Best regards, > > Shaofeng Shi 史少锋 > Apache Kylin PMC > Email: [email protected] > > Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html > Join Kylin user mail group: [email protected] > Join Kylin dev mail group: [email protected] > > > > > Xiaoxiang Yu <[email protected]> 于2020年1月20日周一 下午1:59写道: > > Great news! > I can foresee Kylin could be in a more Cloud-Native way after the mature > of parquet storage. And I wish the developer team will share more detail > for its desgin. > > > > > -- > > Best wishes to you ! > From :Xiaoxiang Yu > > > > At 2020-01-19 22:22:30, "George Ni" <[email protected]> wrote: > Hi Kylin users & developers, > > By-layer Spark Cubing has been introduced into Apache Kylin since v2.0 to > achieve better performance and it does run much faster compared to MR > engine. Also Hbase has been Kylin’s trustful storage engine since Kylin > was > born and it has been proved to be a success for providing the ability to > handle high concurrency queries in extremely large data scale with low > latency. But there are also limitations for HBase, such as filtering is > not > flexible as we could only filter by RowKey, measures are usually combined > together which causes more data to be scanned than requested. > > > > So in order to optimize Kylin in both building strategy and storage > engine, > development team of Kyligence is introducing a new cube building engine > which uses Spark Sql to construct cuboids with a new strategy and stores > cube results in Parquet files. The building strategy allows Kylin to build > cuboids in a smarter way by choosing and building on the optimal cuboid > source. And Parquet, a columnar storage format available to any project in > the Hadoop ecosystem, will power the filtering ability with the page-level > column index and reduce I/O by saving measures in different columns. Also > with Storing cuboid in Parquet instead of Hbase, we can utilize Kylin in > Cloud Native way. More information on design and technique details will > come soon. > > > > Below is the comparison in building duration and size of results between > By-layer Spark Cubing and the new cubing strategy. > > > > Environment > > 4-nodes Hadoop cluster > > YRAN has 400GB RAM and 128 cores in total; > > CDH 5.1, Apache Kylin 3.0. > > > > Spark > > Spark 2.4.1-kylin-r17 > > > > Test Data > > SSB data > > Cube: 15 dimensions, 3 measures (SUM) > > > > Test Scenarios > > Build the cube at different source size level: 30 million, 60 million > source rows; Compare the build time with Spark (by layer) + Hbase and > SparkSql + Parquet. > > > Besides, we attempt to resolve many drawbacks in current query engine, > which relies heavily on Apache Calcite, such as the performance bottleneck > in aggregating large query results which currently can only be operated by > a single worker. By embracing SparkSql, this kind of expensive computing > can be done distributedly. Also combined with Parquet format, plenty of > filtering optimizations could be applied,which will boost Kylin’s query > performance significantly. The features will be open source along with > technique details in the near future. > > > > - https://issues.apache.org/jira/browse/KYLIN-4188 > > > -- > > --------------------- > > Best regards, > > > > Ni Chunen / George > >
