Re: Kylin Building Engine With SparkSql & Parquet

ShaoFeng Shi Tue, 21 Jan 2020 01:11:06 -0800

Chun en,

Thanks for the info. I think we need to discuss more in the community, for
example:


1) When the Parquet storage is released (say in Kylin 4.0), will the HBase
storage still be kept (co-exist), or totally be replaced?
2) Is there a migration tool for migrating HBase cubes to the new storage?

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Email: [email protected]

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: [email protected]
Join Kylin dev mail group: [email protected]




nichunen <[email protected]> 于2020年1月20日周一 下午9:38写道：

> Hi Shaofeng,
>
>
> Below is our plan for this project, any suggestion will be very welcome.
>
>
> 1. In mid-February of 2020, open source the prototype code of this feature
> to branch "kylin-on-parquet-v2", cube can be bulit with new building
> engine, and stored with parquet format.
>
>
> 2. In late April of 2020, the query module for the new storage type is
> scheduled to be ready, a happy path for cube creation, building and query
> will be available then.
>
>
> 3. In May or June of 2020, a Beta version (Kylin 4.0?) will be released.
>
>
>
> Best regards,
>
>
>
> Ni Chunen / George
>
>
>
> On 01/20/2020 16:00，ShaoFeng Shi<[email protected]> wrote：
> Hi, Chun en,
>
> Thanks for the information. What's the detailed release plan of this
> feature to the community?
>
> Best regards,
>
> Shaofeng Shi 史少锋
> Apache Kylin PMC
> Email: [email protected]
>
> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> Join Kylin user mail group: [email protected]
> Join Kylin dev mail group: [email protected]
>
>
>
>
> Xiaoxiang Yu <[email protected]> 于2020年1月20日周一 下午1:59写道：
>
> Great news!
> I can foresee Kylin could be in a more Cloud-Native way after the mature
> of parquet storage. And I wish the developer team will share more detail
> for its desgin.
>
>
>
>
> --
>
> Best wishes to you !
> From ：Xiaoxiang Yu
>
>
>
> At 2020-01-19 22:22:30, "George Ni" <[email protected]> wrote:
> Hi Kylin users & developers,
>
> By-layer Spark Cubing has been introduced into Apache Kylin since v2.0 to
> achieve better performance and it does run much faster compared to MR
> engine. Also Hbase has been Kylin’s trustful storage engine since Kylin
> was
> born and it has been proved to be a success for providing the ability to
> handle high concurrency queries in extremely large data scale with low
> latency. But there are also limitations for HBase, such as filtering is
> not
> flexible as we could only filter by RowKey, measures are usually combined
> together which causes more data to be scanned than requested.
>
>
>
> So in order to optimize Kylin in both building strategy and storage
> engine,
> development team of Kyligence is introducing a new cube building engine
> which uses Spark Sql to construct cuboids with a new strategy and stores
> cube results in Parquet files. The building strategy allows Kylin to build
> cuboids in a smarter way by choosing and building on the optimal cuboid
> source. And Parquet, a columnar storage format available to any project in
> the Hadoop ecosystem, will power the filtering ability with the page-level
> column index and reduce I/O by saving measures in different columns. Also
> with Storing cuboid in Parquet instead of Hbase, we can utilize Kylin in
> Cloud Native way. More information on design and technique details will
> come soon.
>
>
>
> Below is the comparison in building duration and size of results between
> By-layer Spark Cubing and the new cubing strategy.
>
>
>
> Environment
>
> 4-nodes Hadoop cluster
>
> YRAN has 400GB RAM and 128 cores in total;
>
> CDH 5.1, Apache Kylin 3.0.
>
>
>
> Spark
>
> Spark 2.4.1-kylin-r17
>
>
>
> Test Data
>
> SSB data
>
> Cube: 15 dimensions, 3 measures (SUM)
>
>
>
> Test Scenarios
>
> Build the cube at different source size level: 30 million, 60 million
> source rows; Compare the build time with Spark (by layer) + Hbase and
> SparkSql + Parquet.
>
>
> Besides, we attempt to resolve many drawbacks in current query engine,
> which relies heavily on Apache Calcite, such as the performance bottleneck
> in aggregating large query results which currently can only be operated by
> a single worker. By embracing SparkSql, this kind of expensive computing
> can be done distributedly. Also combined with Parquet format, plenty of
> filtering optimizations could be applied,which will boost Kylin’s query
> performance significantly. The features will be open source along with
> technique details in the near future.
>
>
>
> - https://issues.apache.org/jira/browse/KYLIN-4188
>
>
> --
>
> ---------------------
>
> Best regards,
>
>
>
> Ni Chunen / George
>
>

Re: Kylin Building Engine With SparkSql & Parquet

Reply via email to