Re: Kylin Building Engine With SparkSql & Parquet

Liu ehter Fri, 24 Jan 2020 08:51:04 -0800

Sound exciting. All great features!



On 2020/1/21, 4:10 AM, "ShaoFeng Shi" <[email protected]> wrote:

    Chun en,
    
    Thanks for the info. I think we need to discuss more in the community, for
    example:
    
    1) When the Parquet storage is released (say in Kylin 4.0), will the HBase
    storage still be kept (co-exist), or totally be replaced?
    2) Is there a migration tool for migrating HBase cubes to the new storage?
    
    Best regards,
    
    Shaofeng Shi 史少锋
    Apache Kylin PMC
    Email: [email protected]
    
    Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
    Join Kylin user mail group: [email protected]
    Join Kylin dev mail group: [email protected]
    
    
    
    
    nichunen <[email protected]> 于2020年1月20日周一 下午9:38写道：
    
    > Hi Shaofeng,
    >
    >
    > Below is our plan for this project, any suggestion will be very welcome.
    >
    >
    > 1. In mid-February of 2020, open source the prototype code of this feature
    > to branch "kylin-on-parquet-v2", cube can be bulit with new building
    > engine, and stored with parquet format.
    >
    >
    > 2. In late April of 2020, the query module for the new storage type is
    > scheduled to be ready, a happy path for cube creation, building and query
    > will be available then.
    >
    >
    > 3. In May or June of 2020, a Beta version (Kylin 4.0?) will be released.
    >
    >
    >
    > Best regards,
    >
    >
    >
    > Ni Chunen / George
    >
    >
    >
    > On 01/20/2020 16:00，ShaoFeng Shi<[email protected]> wrote：
    > Hi, Chun en,
    >
    > Thanks for the information. What's the detailed release plan of this
    > feature to the community?
    >
    > Best regards,
    >
    > Shaofeng Shi 史少锋
    > Apache Kylin PMC
    > Email: [email protected]
    >
    > Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
    > Join Kylin user mail group: [email protected]
    > Join Kylin dev mail group: [email protected]
    >
    >
    >
    >
    > Xiaoxiang Yu <[email protected]> 于2020年1月20日周一 下午1:59写道：
    >
    > Great news!
    > I can foresee Kylin could be in a more Cloud-Native way after the mature
    > of parquet storage. And I wish the developer team will share more detail
    > for its desgin.
    >
    >
    >
    >
    > --
    >
    > Best wishes to you !
    > From ：Xiaoxiang Yu
    >
    >
    >
    > At 2020-01-19 22:22:30, "George Ni" <[email protected]> wrote:
    > Hi Kylin users & developers,
    >
    > By-layer Spark Cubing has been introduced into Apache Kylin since v2.0 to
    > achieve better performance and it does run much faster compared to MR
    > engine. Also Hbase has been Kylin’s trustful storage engine since Kylin
    > was
    > born and it has been proved to be a success for providing the ability to
    > handle high concurrency queries in extremely large data scale with low
    > latency. But there are also limitations for HBase, such as filtering is
    > not
    > flexible as we could only filter by RowKey, measures are usually combined
    > together which causes more data to be scanned than requested.
    >
    >
    >
    > So in order to optimize Kylin in both building strategy and storage
    > engine,
    > development team of Kyligence is introducing a new cube building engine
    > which uses Spark Sql to construct cuboids with a new strategy and stores
    > cube results in Parquet files. The building strategy allows Kylin to build
    > cuboids in a smarter way by choosing and building on the optimal cuboid
    > source. And Parquet, a columnar storage format available to any project in
    > the Hadoop ecosystem, will power the filtering ability with the page-level
    > column index and reduce I/O by saving measures in different columns. Also
    > with Storing cuboid in Parquet instead of Hbase, we can utilize Kylin in
    > Cloud Native way. More information on design and technique details will
    > come soon.
    >
    >
    >
    > Below is the comparison in building duration and size of results between
    > By-layer Spark Cubing and the new cubing strategy.
    >
    >
    >
    > Environment
    >
    > 4-nodes Hadoop cluster
    >
    > YRAN has 400GB RAM and 128 cores in total;
    >
    > CDH 5.1, Apache Kylin 3.0.
    >
    >
    >
    > Spark
    >
    > Spark 2.4.1-kylin-r17
    >
    >
    >
    > Test Data
    >
    > SSB data
    >
    > Cube: 15 dimensions, 3 measures (SUM)
    >
    >
    >
    > Test Scenarios
    >
    > Build the cube at different source size level: 30 million, 60 million
    > source rows; Compare the build time with Spark (by layer) + Hbase and
    > SparkSql + Parquet.
    >
    >
    > Besides, we attempt to resolve many drawbacks in current query engine,
    > which relies heavily on Apache Calcite, such as the performance bottleneck
    > in aggregating large query results which currently can only be operated by
    > a single worker. By embracing SparkSql, this kind of expensive computing
    > can be done distributedly. Also combined with Parquet format, plenty of
    > filtering optimizations could be applied,which will boost Kylin’s query
    > performance significantly. The features will be open source along with
    > technique details in the near future.
    >
    >
    >
    > - https://issues.apache.org/jira/browse/KYLIN-4188
    >
    >
    > --
    >
    > ---------------------
    >
    > Best regards,
    >
    >
    >
    > Ni Chunen / George
    >
    >

Re: Kylin Building Engine With SparkSql & Parquet

Reply via email to