Re: [DISCUSS] Columnar storage engine for Apache Kylin

Luke Han Sun, 07 Oct 2018 04:44:42 -0700

It makes sense to bring a better storage option for Kylin.

The option should be open and people could have different ways to create an
adaptor for the underlying storage.
Considering huge adoptions of Kylin today are all run on Hadoop/HDFS, I
prefer for Parquet or ORC or other HDFS compatible option at this time. It
will easy for people to upgrade to the next generation and keep consistency.


Looking forward to this feature to be rolled out soon.

Thanks.



Best Regards!
---------------------

Luke Han


On Wed, Oct 3, 2018 at 2:37 PM Li Yang <[email protected]> wrote:

> Love this discussion. Like to highlight 3 major roles HBase is playing
> currently, so we don't miss any of them when looking for a replacement.
>
> 1) Storage: A high speed big data storage
> 2) Cache: A distributed storage cache layer (was BlockCache)
> 3) MPP: A distributed computation framework (was Coprocessor)
>
> The "Storage" seems at the central of discussion. Be it Parquet, ORC, or a
> new file format, to me the standard interface is most important. As long as
> we have consensus on the access interface, like MapReduce / Spark Dataset,
> then the rest of debate can be easily resolved by a fair benchmark. Also it
> allows people with different preference to keep their own implementation
> under the standard interface, and not impacting the rest of Kylin.
>
> The "Cache" and the "MPP" were more or less overlooked. I suggest we pay
> more attentions to them. Apart from Spark and Alluxio, any other
> alternatives? Actually Druid is a well-rounded choice, as like HBase, it
> covers all the 3 roles pretty well.
>
> In general, I prefer to choose from the state of the art instead of
> re-inventing. Indeed, Kylin is not a storage project. A new storage format
> is not Kylin's mission. Any storage innovations we come across here would
> be more beneficial if contribute to Parquet or ORC community.
>
> Regards
> Yang
>
>
>
> On Tue, Oct 2, 2018 at 11:20 AM ShaoFeng Shi <[email protected]>
> wrote:
>
> > Hi Billy,
> >
> > Yes, the cloud storage should be considered. The traditional file layouts
> > on HDFS may not work well on cloud storage. Kylin needs to allow
> extension
> > here. I will add this to the requirement.
> >
> > Billy Liu <[email protected]> 于2018年9月29日周六 下午3:22写道：
> >
> > > Hi Shaofeng,
> > >
> > > I'd like to add one more character: cloud-native storage support.
> > > Quite a few users are using S3 on AWS, or Azure Data Lake Storage on
> > > Azure. If new storage engine could be more cloud friendly, more user
> > > could get benefits from it.
> > >
> > > With Warm regards
> > >
> > > Billy Liu
> > > ShaoFeng Shi <[email protected]> 于2018年9月28日周五 下午2:15写道：
> > > >
> > > > Hi Kylin developers.
> > > >
> > > > HBase has been Kylin’s storage engine since the first day; Kylin on
> > HBase
> > > > has been verified as a success which can support low latency & high
> > > > concurrency queries on a very large data scale. Thanks to HBase, most
> > > Kylin
> > > > users can get on average less than 1-second query response.
> > > >
> > > > But we also see some limitations when putting Cubes into HBase; I
> > shared
> > > > some of them in the HBaseConf Asia 2018[1] this August. The typical
> > > > limitations include:
> > > >
> > > >    - Rowkey is the primary index, no secondary index so far;
> > > >
> > > > Filtering by row key’s prefix and suffix can get very different
> > > performance
> > > > result. So the user needs to do a good design about the row key;
> > > otherwise,
> > > > the query would be slow. This is difficult sometimes because the user
> > > might
> > > > not predict the filtering patterns ahead of cube design.
> > > >
> > > >    - HBase is a key-value instead of a columnar storage
> > > >
> > > > Kylin combines multiple measures (columns) into fewer column families
> > for
> > > > smaller data size (row key size is remarkable). This causes HBase
> often
> > > > needing to read more data than requested.
> > > >
> > > >    - HBase couldn't run on YARN
> > > >
> > > > This makes the deployment and auto-scaling a little complicated,
> > > especially
> > > > in the cloud.
> > > >
> > > > In one word, HBase is complicated to be Kylin’s storage. The
> > maintenance,
> > > > debugging is also hard for normal developers. Now we’re planning to
> > seek
> > > a
> > > > simple, light-weighted, read-only storage engine for Kylin. The new
> > > > solution should have the following characteristics:
> > > >
> > > >    - Columnar layout with compression for efficient I/O;
> > > >    - Index by each column for quick filtering and seeking;
> > > >    - MapReduce / Spark API for parallel processing;
> > > >    - HDFS compliant for scalability and availability;
> > > >    - Mature, stable and extensible;
> > > >
> > > > With the plugin architecture[2] introduced in Kylin 1.5, adding
> > multiple
> > > > storages to Kylin is possible. Some companies like Kyligence Inc and
> > > > Meituan.com, have developed their customized storage engine for Kylin
> > in
> > > > their product or platform. In their experience, columnar storage is a
> > > good
> > > > supplement for the HBase engine. Kaisen Kang from Meituan.com has
> > shared
> > > > their KOD (Kylin on Druid) solution[3] in this August’s Kylin meetup
> in
> > > > Beijing.
> > > >
> > > > We plan to do a PoC with Apache Parquet + Apache Spark in the next
> > phase.
> > > > Parquet is a standard columnar file format and has been widely
> > supported
> > > by
> > > > many projects like Hive, Impala, Drill, etc. Parquet is adding the
> page
> > > > level column index to support fine-grained filtering.  Apache Spark
> can
> > > > provide the parallel computing over Parquet and can be deployed on
> > > > YARN/Mesos and Kubernetes. With this combination, the data
> persistence
> > > and
> > > > computation are separated, which makes the scaling in/out much easier
> > > than
> > > > before. Benefiting from Spark's flexibility, we can not only push
> down
> > > more
> > > > computation from Kylin to the Hadoop cluster. Except for Parquet,
> > Apache
> > > > ORC is also a candidate.
> > > >
> > > > Now I raise this discussion to get your ideas about Kylin’s
> > > next-generation
> > > > storage engine. If you have good ideas or any related data, welcome
> > > discuss in
> > > > the community.
> > > >
> > > > Thank you!
> > > >
> > > > [1] Apache Kylin on HBase
> > > >
> > >
> >
> https://www.slideshare.net/ShiShaoFeng1/apache-kylin-on-hbase-extreme-olap-engine-for-big-data
> > > > [2] Apache Kylin Plugin Architecture
> > > > https://kylin.apache.org/development/plugin_arch.html
> > > > [3] 基于Druid的Kylin存储引擎实践
> > > https://blog.bcmeng.com/post/kylin-on-druid.html--
> > > > Best regards,
> > > >
> > > > Shaofeng Shi 史少锋
> > >
> >
> >
> > --
> > Best regards,
> >
> > Shaofeng Shi 史少锋
> >
>

Re: [DISCUSS] Columnar storage engine for Apache Kylin

Reply via email to