Exactly; Thank you jiatao for the comments!
JiaTao Tao <[email protected]> 于2018年10月25日周四 下午6:12写道:
> As far as I'm concerned, using Parquet as Kylin's storage format is pretty
> appropriate. From the aspect of integrating Spark, Spark made a lot of
> optimizations for Parquet, e.g. We can enjoy Spark's vectorized reading and
> lazy dict decoding, etc.
>
>
> And here are my thoughts about integrating Spark and our query engine. As
> Shaofeng mentioned, a cuboid is a Parquet file, and you can think of this
> as a small table and we can read this cuboid as a DataFrame directly, which
> can be directly queried by Spark, a bit like this:
>
> ss.read.parquet("path/to/CuboidFile").filter("xxx").agg("xxx").select("xxx").
> (We need to implement some Kylin's advanced aggregations, as for some
> Kylin's basic aggregations like sum/min/max, we can use Spark's directly)
>
>
>
> *Compare to our old query engine, the advantages are as follows:*
>
>
>
> 1. It is distributed! Our old query engine will get all data into a query
> node and then calculate, it's a single point of failure and often leads OOM
> when in a huge amount of data.
>
>
>
> 2. It is simple and easy to debug(every step is very clear and
> transparent), you can collect data after every single phase,
> e.g.(filter/aggregation/projection, etc.), so you can easily check out
> which operation/phase went wrong. Our old query engine uses Calcite for
> post-calculation, it's difficult when pinpointing problems, especially when
> relating to code generation, and you cannot insert your own logic during
> computation.
>
>
>
> 3. We can fully enjoy all efforts that Spark made for optimizing
> performance, e.g. Catalyst/Tungsten, etc.
>
>
>
> 4. It is easy for unit tests, you can test every step separately, which
> could reduce the testing granularity of Kylin's query engine.
>
>
>
> 5. Thanks to Spark's DataSource API, we can change Parquet to other data
> formats easily.
>
>
>
> 6. A lot of upstream tools for Spark like many machine learning tools can
> directly be integrated with us.
>
>
>
> ==================
>
> ======================================================================================================================
>
> Hi Kylin developers.
>
>
>
> HBase has been Kylin’s storage engine since the first day; Kylin on
> HBase
>
> has been verified as a success which can support low latency & high
>
> concurrency queries on a very large data scale. Thanks to HBase, most
> Kylin
>
> users can get on average less than 1-second query response.
>
>
>
> But we also see some limitations when putting Cubes into HBase; I
> shared
>
> some of them in the HBaseConf Asia 2018[1] this August. The typical
>
> limitations include:
>
>
>
> - Rowkey is the primary index, no secondary index so far;
>
>
>
> Filtering by row key’s prefix and suffix can get very different
> performance
>
> result. So the user needs to do a good design about the row key;
> otherwise,
>
> the query would be slow. This is difficult sometimes because the user
> might
>
> not predict the filtering patterns ahead of cube design.
>
>
>
> - HBase is a key-value instead of a columnar storage
>
>
>
> Kylin combines multiple measures (columns) into fewer column families
> for
>
> smaller data size (row key size is remarkable). This causes HBase often
>
> needing to read more data than requested.
>
>
>
> - HBase couldn't run on YARN
>
>
>
> This makes the deployment and auto-scaling a little complicated,
> especially
>
> in the cloud.
>
>
>
> In one word, HBase is complicated to be Kylin’s storage. The
> maintenance,
>
> debugging is also hard for normal developers. Now we’re planning to
> seek a
>
> simple, light-weighted, read-only storage engine for Kylin. The new
>
> solution should have the following characteristics:
>
>
>
> - Columnar layout with compression for efficient I/O;
>
> - Index by each column for quick filtering and seeking;
>
> - MapReduce / Spark API for parallel processing;
>
> - HDFS compliant for scalability and availability;
>
> - Mature, stable and extensible;
>
>
>
> With the plugin architecture[2] introduced in Kylin 1.5, adding
> multiple
>
> storages to Kylin is possible. Some companies like Kyligence Inc and
>
> Meituan.com, have developed their customized storage engine for Kylin
> in
>
> their product or platform. In their experience, columnar storage is a
> good
>
> supplement for the HBase engine. Kaisen Kang from Meituan.com has
> shared
>
> their KOD (Kylin on Druid) solution[3] in this August’s Kylin meetup in
>
> Beijing.
>
>
>
> We plan to do a PoC with Apache Parquet + Apache Spark in the next
> phase.
>
> Parquet is a standard columnar file format and has been widely
> supported by
>
> many projects like Hive, Impala, Drill, etc. Parquet is adding the page
>
> level column index to support fine-grained filtering. Apache Spark can
>
> provide the parallel computing over Parquet and can be deployed on
>
> YARN/Mesos and Kubernetes. With this combination, the data persistence
> and
>
> computation are separated, which makes the scaling in/out much easier
> than
>
> before. Benefiting from Spark's flexibility, we can not only push down
> more
>
> computation from Kylin to the Hadoop cluster. Except for Parquet,
> Apache
>
> ORC is also a candidate.
>
>
>
> Now I raise this discussion to get your ideas about Kylin’s
> next-generation
>
> storage engine. If you have good ideas or any related data, welcome
> discuss in
>
> the community.
>
>
>
> Thank you!
>
>
>
> [1] Apache Kylin on HBase
>
>
>
> https://www.slideshare.net/ShiShaoFeng1/apache-kylin-on-hbase-extreme-olap-engine-for-big-data
>
> [2] Apache Kylin Plugin Architecture
>
> https://kylin.apache.org/development/plugin_arch.html
>
> [3] 基于Druid的Kylin存储引擎实践
> https://blog.bcmeng.com/post/kylin-on-druid.html--
>
> Best regards,
>
>
>
> Shaofeng Shi 史少锋
>
--
Best regards,
Shaofeng Shi 史少锋