Re: Kylin Building Engine With SparkSql & Parquet

Luke Han Tue, 04 Feb 2020 22:05:13 -0800

I agree, one storage for next-g kylin is good enough.
But would like to keep the interface as of today's best practices, so that
people could easily extend to other storage options.


Best Regards!
---------------------

Luke Han


On Sat, Feb 1, 2020 at 9:13 PM ShaoFeng Shi <[email protected]> wrote:

> In my opinion, it is very hard to maintain HBase storage and parquet
> storage together. So parquet storage is stable enough, the Kylin 4.0 can no
> longer depend on HBase.
>
> Best regards,
>
> Shaofeng Shi 史少锋
> Apache Kylin PMC
> Email: [email protected]
>
> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> Join Kylin user mail group: [email protected]
> Join Kylin dev mail group: [email protected]
>
>
>
>
> nichunen <[email protected]> 于2020年1月30日周四 下午11:04写道：
>
> > Hi Shaofeng,
> >
> >
> > For your questions:
> >
> >
> > 1) When the Parquet storage is released (say in Kylin 4.0), will the
> HBase
> > storage still be kept (co-exist), or totally be replaced?
> > I think we will keep an active branch with releases for Hbase storage, it
> > won’t be totally replaced in the near feature.
> >
> > 2) Is there a migration tool for migrating HBase cubes to the new
> storage?
> >
> > The tool is in the developing plan. What’s more, the metadata will be
> > compatible.
> >
> >
> >
> > Best regards,
> >
> >
> >
> > Ni Chunen / George
> >
> >
> > On 2020/1/21, 4:10 AM, "ShaoFeng Shi" <[email protected]> wrote:
> >
> > Chun en,
> >
> > Thanks for the info. I think we need to discuss more in the community,
> for
> > example:
> >
> > 1) When the Parquet storage is released (say in Kylin 4.0), will the
> HBase
> > storage still be kept (co-exist), or totally be replaced?
> > 2) Is there a migration tool for migrating HBase cubes to the new
> storage?
> >
> > Best regards,
> >
> > Shaofeng Shi 史少锋
> > Apache Kylin PMC
> > Email: [email protected]
> >
> > Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> > Join Kylin user mail group: [email protected]
> > Join Kylin dev mail group: [email protected]
> >
> >
> >
> >
> > nichunen <[email protected]> 于2020年1月20日周一 下午9:38写道：
> >
> > Hi Shaofeng,
> >
> >
> > Below is our plan for this project, any suggestion will be very welcome.
> >
> >
> > 1. In mid-February of 2020, open source the prototype code of this
> feature
> > to branch "kylin-on-parquet-v2", cube can be bulit with new building
> > engine, and stored with parquet format.
> >
> >
> > 2. In late April of 2020, the query module for the new storage type is
> > scheduled to be ready, a happy path for cube creation, building and query
> > will be available then.
> >
> >
> > 3. In May or June of 2020, a Beta version (Kylin 4.0?) will be released.
> >
> >
> >
> > Best regards,
> >
> >
> >
> > Ni Chunen / George
> >
> >
> >
> > On 01/20/2020 16:00，ShaoFeng Shi<[email protected]> wrote：
> > Hi, Chun en,
> >
> > Thanks for the information. What's the detailed release plan of this
> > feature to the community?
> >
> > Best regards,
> >
> > Shaofeng Shi 史少锋
> > Apache Kylin PMC
> > Email: [email protected]
> >
> > Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> > Join Kylin user mail group: [email protected]
> > Join Kylin dev mail group: [email protected]
> >
> >
> >
> >
> > Xiaoxiang Yu <[email protected]> 于2020年1月20日周一 下午1:59写道：
> >
> > Great news!
> > I can foresee Kylin could be in a more Cloud-Native way after the mature
> > of parquet storage. And I wish the developer team will share more detail
> > for its desgin.
> >
> >
> >
> >
> > --
> >
> > Best wishes to you !
> > From ：Xiaoxiang Yu
> >
> >
> >
> > At 2020-01-19 22:22:30, "George Ni" <[email protected]> wrote:
> > Hi Kylin users & developers,
> >
> > By-layer Spark Cubing has been introduced into Apache Kylin since v2.0 to
> > achieve better performance and it does run much faster compared to MR
> > engine. Also Hbase has been Kylin’s trustful storage engine since Kylin
> > was
> > born and it has been proved to be a success for providing the ability to
> > handle high concurrency queries in extremely large data scale with low
> > latency. But there are also limitations for HBase, such as filtering is
> > not
> > flexible as we could only filter by RowKey, measures are usually combined
> > together which causes more data to be scanned than requested.
> >
> >
> >
> > So in order to optimize Kylin in both building strategy and storage
> > engine,
> > development team of Kyligence is introducing a new cube building engine
> > which uses Spark Sql to construct cuboids with a new strategy and stores
> > cube results in Parquet files. The building strategy allows Kylin to
> build
> > cuboids in a smarter way by choosing and building on the optimal cuboid
> > source. And Parquet, a columnar storage format available to any project
> in
> > the Hadoop ecosystem, will power the filtering ability with the
> page-level
> > column index and reduce I/O by saving measures in different columns. Also
> > with Storing cuboid in Parquet instead of Hbase, we can utilize Kylin in
> > Cloud Native way. More information on design and technique details will
> > come soon.
> >
> >
> >
> > Below is the comparison in building duration and size of results between
> > By-layer Spark Cubing and the new cubing strategy.
> >
> >
> >
> > Environment
> >
> > 4-nodes Hadoop cluster
> >
> > YRAN has 400GB RAM and 128 cores in total;
> >
> > CDH 5.1, Apache Kylin 3.0.
> >
> >
> >
> > Spark
> >
> > Spark 2.4.1-kylin-r17
> >
> >
> >
> > Test Data
> >
> > SSB data
> >
> > Cube: 15 dimensions, 3 measures (SUM)
> >
> >
> >
> > Test Scenarios
> >
> > Build the cube at different source size level: 30 million, 60 million
> > source rows; Compare the build time with Spark (by layer) + Hbase and
> > SparkSql + Parquet.
> >
> >
> > Besides, we attempt to resolve many drawbacks in current query engine,
> > which relies heavily on Apache Calcite, such as the performance
> bottleneck
> > in aggregating large query results which currently can only be operated
> by
> > a single worker. By embracing SparkSql, this kind of expensive computing
> > can be done distributedly. Also combined with Parquet format, plenty of
> > filtering optimizations could be applied,which will boost Kylin’s query
> > performance significantly. The features will be open source along with
> > technique details in the near future.
> >
> >
> >
> > - https://issues.apache.org/jira/browse/KYLIN-4188
> >
> >
> > --
> >
> > ---------------------
> >
> > Best regards,
> >
> >
> >
> > Ni Chunen / George
> >
> >
> >
> >
>

Re: Kylin Building Engine With SparkSql & Parquet

Reply via email to