Re: [DISCUSS] Columnar storage engine for Apache Kylin

ShaoFeng Shi Mon, 01 Oct 2018 20:20:43 -0700

Hi Billy,

Yes, the cloud storage should be considered. The traditional file layouts
on HDFS may not work well on cloud storage. Kylin needs to allow extension
here. I will add this to the requirement.


Billy Liu <billy...@apache.org> 于2018年9月29日周六 下午3:22写道：

> Hi Shaofeng,
>
> I'd like to add one more character: cloud-native storage support.
> Quite a few users are using S3 on AWS, or Azure Data Lake Storage on
> Azure. If new storage engine could be more cloud friendly, more user
> could get benefits from it.
>
> With Warm regards
>
> Billy Liu
> ShaoFeng Shi <shaofeng...@apache.org> 于2018年9月28日周五 下午2:15写道：
> >
> > Hi Kylin developers.
> >
> > HBase has been Kylin’s storage engine since the first day; Kylin on HBase
> > has been verified as a success which can support low latency & high
> > concurrency queries on a very large data scale. Thanks to HBase, most
> Kylin
> > users can get on average less than 1-second query response.
> >
> > But we also see some limitations when putting Cubes into HBase; I shared
> > some of them in the HBaseConf Asia 2018[1] this August. The typical
> > limitations include:
> >
> >    - Rowkey is the primary index, no secondary index so far;
> >
> > Filtering by row key’s prefix and suffix can get very different
> performance
> > result. So the user needs to do a good design about the row key;
> otherwise,
> > the query would be slow. This is difficult sometimes because the user
> might
> > not predict the filtering patterns ahead of cube design.
> >
> >    - HBase is a key-value instead of a columnar storage
> >
> > Kylin combines multiple measures (columns) into fewer column families for
> > smaller data size (row key size is remarkable). This causes HBase often
> > needing to read more data than requested.
> >
> >    - HBase couldn't run on YARN
> >
> > This makes the deployment and auto-scaling a little complicated,
> especially
> > in the cloud.
> >
> > In one word, HBase is complicated to be Kylin’s storage. The maintenance,
> > debugging is also hard for normal developers. Now we’re planning to seek
> a
> > simple, light-weighted, read-only storage engine for Kylin. The new
> > solution should have the following characteristics:
> >
> >    - Columnar layout with compression for efficient I/O;
> >    - Index by each column for quick filtering and seeking;
> >    - MapReduce / Spark API for parallel processing;
> >    - HDFS compliant for scalability and availability;
> >    - Mature, stable and extensible;
> >
> > With the plugin architecture[2] introduced in Kylin 1.5, adding multiple
> > storages to Kylin is possible. Some companies like Kyligence Inc and
> > Meituan.com, have developed their customized storage engine for Kylin in
> > their product or platform. In their experience, columnar storage is a
> good
> > supplement for the HBase engine. Kaisen Kang from Meituan.com has shared
> > their KOD (Kylin on Druid) solution[3] in this August’s Kylin meetup in
> > Beijing.
> >
> > We plan to do a PoC with Apache Parquet + Apache Spark in the next phase.
> > Parquet is a standard columnar file format and has been widely supported
> by
> > many projects like Hive, Impala, Drill, etc. Parquet is adding the page
> > level column index to support fine-grained filtering.  Apache Spark can
> > provide the parallel computing over Parquet and can be deployed on
> > YARN/Mesos and Kubernetes. With this combination, the data persistence
> and
> > computation are separated, which makes the scaling in/out much easier
> than
> > before. Benefiting from Spark's flexibility, we can not only push down
> more
> > computation from Kylin to the Hadoop cluster. Except for Parquet, Apache
> > ORC is also a candidate.
> >
> > Now I raise this discussion to get your ideas about Kylin’s
> next-generation
> > storage engine. If you have good ideas or any related data, welcome
> discuss in
> > the community.
> >
> > Thank you!
> >
> > [1] Apache Kylin on HBase
> >
> https://www.slideshare.net/ShiShaoFeng1/apache-kylin-on-hbase-extreme-olap-engine-for-big-data
> > [2] Apache Kylin Plugin Architecture
> > https://kylin.apache.org/development/plugin_arch.html
> > [3] 基于Druid的Kylin存储引擎实践
> https://blog.bcmeng.com/post/kylin-on-druid.html--
> > Best regards,
> >
> > Shaofeng Shi 史少锋
>


-- 
Best regards,

Shaofeng Shi 史少锋

Re: [DISCUSS] Columnar storage engine for Apache Kylin

Reply via email to