Re: [DISCUSS] Columnar storage engine for Apache Kylin

JiaTao Tao Fri, 26 Oct 2018 20:39:02 -0700

You are welcome, ShaoFeng! Storage and query engine are inseparable and
should design together for fully gaining each other's abilities. And I'm
very excited about the new coming columnar storage and query engine!



-- 


Regards!

Aron Tao


ShaoFeng Shi <shaofeng...@apache.org> 于2018年10月26日周五 下午10:28写道：

> Exactly; Thank you jiatao for the comments!
>
> JiaTao Tao <taojia...@gmail.com> 于2018年10月25日周四 下午6:12写道：
>
> > As far as I'm concerned, using Parquet as Kylin's storage format is
> pretty
> > appropriate. From the aspect of integrating Spark, Spark made a lot of
> > optimizations for Parquet, e.g. We can enjoy Spark's vectorized reading
> and
> > lazy dict decoding, etc.
> >
> >
> > And here are my thoughts about integrating Spark and our query engine. As
> > Shaofeng mentioned, a cuboid is a Parquet file, and you can think of this
> > as a small table and we can read this cuboid as a DataFrame directly,
> which
> > can be directly queried by Spark, a bit like this:
> >
> >
> ss.read.parquet("path/to/CuboidFile").filter("xxx").agg("xxx").select("xxx").
> > (We need to implement some Kylin's advanced aggregations, as for some
> > Kylin's basic aggregations like sum/min/max, we can use Spark's directly)
> >
> >
> >
> > *Compare to our old query engine, the advantages are as follows:*
> >
> >
> >
> > 1. It is distributed! Our old query engine will get all data into a query
> > node and then calculate, it's a single point of failure and often leads
> OOM
> > when in a huge amount of data.
> >
> >
> >
> > 2. It is simple and easy to debug(every step is very clear and
> > transparent), you can collect data after every single phase,
> > e.g.(filter/aggregation/projection, etc.), so you can easily check out
> > which operation/phase went wrong. Our old query engine uses Calcite for
> > post-calculation, it's difficult when pinpointing problems, especially
> when
> > relating to code generation, and you cannot insert your own logic during
> > computation.
> >
> >
> >
> > 3. We can fully enjoy all efforts that Spark made for optimizing
> > performance, e.g. Catalyst/Tungsten, etc.
> >
> >
> >
> > 4. It is easy for unit tests, you can test every step separately, which
> > could reduce the testing granularity of Kylin's query engine.
> >
> >
> >
> > 5. Thanks to Spark's DataSource API, we can change Parquet to other data
> > formats easily.
> >
> >
> >
> > 6. A lot of upstream tools for Spark like many machine learning tools can
> > directly be integrated with us.
> >
> >
> >
> > ==================
> >
> >
> ======================================================================================================================
> >
> >  Hi Kylin developers.
> >
> >
> >
> >     HBase has been Kylin’s storage engine since the first day; Kylin on
> > HBase
> >
> >     has been verified as a success which can support low latency & high
> >
> >     concurrency queries on a very large data scale. Thanks to HBase, most
> > Kylin
> >
> >     users can get on average less than 1-second query response.
> >
> >
> >
> >     But we also see some limitations when putting Cubes into HBase; I
> > shared
> >
> >     some of them in the HBaseConf Asia 2018[1] this August. The typical
> >
> >     limitations include:
> >
> >
> >
> >        - Rowkey is the primary index, no secondary index so far;
> >
> >
> >
> >     Filtering by row key’s prefix and suffix can get very different
> > performance
> >
> >     result. So the user needs to do a good design about the row key;
> > otherwise,
> >
> >     the query would be slow. This is difficult sometimes because the user
> > might
> >
> >     not predict the filtering patterns ahead of cube design.
> >
> >
> >
> >        - HBase is a key-value instead of a columnar storage
> >
> >
> >
> >     Kylin combines multiple measures (columns) into fewer column families
> > for
> >
> >     smaller data size (row key size is remarkable). This causes HBase
> often
> >
> >     needing to read more data than requested.
> >
> >
> >
> >        - HBase couldn't run on YARN
> >
> >
> >
> >     This makes the deployment and auto-scaling a little complicated,
> > especially
> >
> >     in the cloud.
> >
> >
> >
> >     In one word, HBase is complicated to be Kylin’s storage. The
> > maintenance,
> >
> >     debugging is also hard for normal developers. Now we’re planning to
> > seek a
> >
> >     simple, light-weighted, read-only storage engine for Kylin. The new
> >
> >     solution should have the following characteristics:
> >
> >
> >
> >        - Columnar layout with compression for efficient I/O;
> >
> >        - Index by each column for quick filtering and seeking;
> >
> >        - MapReduce / Spark API for parallel processing;
> >
> >        - HDFS compliant for scalability and availability;
> >
> >        - Mature, stable and extensible;
> >
> >
> >
> >     With the plugin architecture[2] introduced in Kylin 1.5, adding
> > multiple
> >
> >     storages to Kylin is possible. Some companies like Kyligence Inc and
> >
> >     Meituan.com, have developed their customized storage engine for Kylin
> > in
> >
> >     their product or platform. In their experience, columnar storage is a
> > good
> >
> >     supplement for the HBase engine. Kaisen Kang from Meituan.com has
> > shared
> >
> >     their KOD (Kylin on Druid) solution[3] in this August’s Kylin meetup
> in
> >
> >     Beijing.
> >
> >
> >
> >     We plan to do a PoC with Apache Parquet + Apache Spark in the next
> > phase.
> >
> >     Parquet is a standard columnar file format and has been widely
> > supported by
> >
> >     many projects like Hive, Impala, Drill, etc. Parquet is adding the
> page
> >
> >     level column index to support fine-grained filtering.  Apache Spark
> can
> >
> >     provide the parallel computing over Parquet and can be deployed on
> >
> >     YARN/Mesos and Kubernetes. With this combination, the data
> persistence
> > and
> >
> >     computation are separated, which makes the scaling in/out much easier
> > than
> >
> >     before. Benefiting from Spark's flexibility, we can not only push
> down
> > more
> >
> >     computation from Kylin to the Hadoop cluster. Except for Parquet,
> > Apache
> >
> >     ORC is also a candidate.
> >
> >
> >
> >     Now I raise this discussion to get your ideas about Kylin’s
> > next-generation
> >
> >     storage engine. If you have good ideas or any related data, welcome
> > discuss in
> >
> >     the community.
> >
> >
> >
> >     Thank you!
> >
> >
> >
> >     [1] Apache Kylin on HBase
> >
> >
> >
> >
> https://www.slideshare.net/ShiShaoFeng1/apache-kylin-on-hbase-extreme-olap-engine-for-big-data
> >
> >     [2] Apache Kylin Plugin Architecture
> >
> >     https://kylin.apache.org/development/plugin_arch.html
> >
> >     [3] 基于Druid的Kylin存储引擎实践
> > https://blog.bcmeng.com/post/kylin-on-druid.html--
> >
> >     Best regards,
> >
> >
> >
> >     Shaofeng Shi 史少锋
> >
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>

Re: [DISCUSS] Columnar storage engine for Apache Kylin

Reply via email to