> As far as I'm concerned, using Parquet as Kylin's storage format is pretty
> appropriate. From the aspect of integrating Spark, Spark made a lot of
> optimizations for Parquet, e.g. We can enjoy Spark's vectorized reading and
> lazy dict decoding, etc.
> And here are my thoughts about integrating Spark and our query engine. As
> Shaofeng mentioned, a cuboid is a Parquet file, and you can think of this
> as a small table and we can read this cuboid as a DataFrame directly, which
> can be directly queried by Spark, a bit like this:
> ss.read.parquet("path/to/CuboidFile").filter("xxx").agg("xxx").select("xxx").
> (We need to implement some Kylin's advanced aggregations, as for some
> Kylin's basic aggregations like sum/min/max, we can use Spark's directly)
> *Compare to our old query engine, the advantages are as follows:*
> 1. It is distributed! Our old query engine will get all data into a query
> node and then calculate, it's a single point of failure and often leads OOM
> when in a huge amount of data.
> 2. It is simple and easy to debug(every step is very clear and
> transparent), you can collect data after every single phase,
> e.g.(filter/aggregation/projection, etc.), so you can easily check out
> which operation/phase went wrong. Our old query engine uses Calcite for
> post-calculation, it's difficult when pinpointing problems, especially when
> relating to code generation, and you cannot insert your own logic during
> computation.
> 3. We can fully enjoy all efforts that Spark made for optimizing
> performance, e.g. Catalyst/Tungsten, etc.
> 4. It is easy for unit tests, you can test every step separately, which
> could reduce the testing granularity of Kylin's query engine.
> 5. Thanks to Spark's DataSource API, we can change Parquet to other data
> formats easily.
> 6. A lot of upstream tools for Spark like many machine learning tools can
> directly be integrated with us.
>  Hi Kylin developers.
>     HBase has been Kylin’s storage engine since the first day; Kylin on
> HBase
>     has been verified as a success which can support low latency & high
>     concurrency queries on a very large data scale. Thanks to HBase, most
> Kylin
>     users can get on average less than 1-second query response.
>     But we also see some limitations when putting Cubes into HBase; I
> shared
>     some of them in the HBaseConf Asia 2018[1] this August. The typical
>     limitations include:
>        - Rowkey is the primary index, no secondary index so far;
>     Filtering by row key’s prefix and suffix can get very different
> performance
>     result. So the user needs to do a good design about the row key;
> otherwise,
>     the query would be slow. This is difficult sometimes because the user
> might
>     not predict the filtering patterns ahead of cube design.
>        - HBase is a key-value instead of a columnar storage
>     Kylin combines multiple measures (columns) into fewer column families
> for
>     smaller data size (row key size is remarkable). This causes HBase often
>     needing to read more data than requested.
>        - HBase couldn't run on YARN
>     This makes the deployment and auto-scaling a little complicated,
> especially
>     in the cloud.
>     In one word, HBase is complicated to be Kylin’s storage. The
> maintenance,
>     debugging is also hard for normal developers. Now we’re planning to
> seek a
>     simple, light-weighted, read-only storage engine for Kylin. The new
>     solution should have the following characteristics:
>        - Columnar layout with compression for efficient I/O;
>        - Index by each column for quick filtering and seeking;
>        - MapReduce / Spark API for parallel processing;
>        - HDFS compliant for scalability and availability;
>        - Mature, stable and extensible;
>     With the plugin architecture[2] introduced in Kylin 1.5, adding
> multiple
>     storages to Kylin is possible. Some companies like Kyligence Inc and
>     Meituan.com, have developed their customized storage engine for Kylin
> in
>     their product or platform. In their experience, columnar storage is a
> good
>     supplement for the HBase engine. Kaisen Kang from Meituan.com has
> shared
>     their KOD (Kylin on Druid) solution[3] in this August’s Kylin meetup in
>     Beijing.
>     We plan to do a PoC with Apache Parquet + Apache Spark in the next
> phase.
>     Parquet is a standard columnar file format and has been widely
> supported by
>     many projects like Hive, Impala, Drill, etc. Parquet is adding the page
>     level column index to support fine-grained filtering.  Apache Spark can
>     provide the parallel computing over Parquet and can be deployed on
>     YARN/Mesos and Kubernetes. With this combination, the data persistence
> and
>     computation are separated, which makes the scaling in/out much easier
> than
>     before. Benefiting from Spark's flexibility, we can not only push down
> more
>     computation from Kylin to the Hadoop cluster. Except for Parquet,
> Apache
>     ORC is also a candidate.
>     Now I raise this discussion to get your ideas about Kylin’s
> next-generation
>     storage engine. If you have good ideas or any related data, welcome
> discuss in
>     the community.
>     Thank you!
>     [1] Apache Kylin on HBase
> https://www.slideshare.net/ShiShaoFeng1/apache-kylin-on-hbase-extreme-olap-engine-for-big-data
>     [2] Apache Kylin Plugin Architecture
>     https://kylin.apache.org/development/plugin_arch.html
>     [3] 基于Druid的Kylin存储引擎实践
> https://blog.bcmeng.com/post/kylin-on-druid.html--
