Hi Billy, Yes, the cloud storage should be considered. The traditional file layouts on HDFS may not work well on cloud storage. Kylin needs to allow extension here. I will add this to the requirement.
Billy Liu <[email protected]> 于2018年9月29日周六 下午3:22写道: > Hi Shaofeng, > > I'd like to add one more character: cloud-native storage support. > Quite a few users are using S3 on AWS, or Azure Data Lake Storage on > Azure. If new storage engine could be more cloud friendly, more user > could get benefits from it. > > With Warm regards > > Billy Liu > ShaoFeng Shi <[email protected]> 于2018年9月28日周五 下午2:15写道: > > > > Hi Kylin developers. > > > > HBase has been Kylin’s storage engine since the first day; Kylin on HBase > > has been verified as a success which can support low latency & high > > concurrency queries on a very large data scale. Thanks to HBase, most > Kylin > > users can get on average less than 1-second query response. > > > > But we also see some limitations when putting Cubes into HBase; I shared > > some of them in the HBaseConf Asia 2018[1] this August. The typical > > limitations include: > > > > - Rowkey is the primary index, no secondary index so far; > > > > Filtering by row key’s prefix and suffix can get very different > performance > > result. So the user needs to do a good design about the row key; > otherwise, > > the query would be slow. This is difficult sometimes because the user > might > > not predict the filtering patterns ahead of cube design. > > > > - HBase is a key-value instead of a columnar storage > > > > Kylin combines multiple measures (columns) into fewer column families for > > smaller data size (row key size is remarkable). This causes HBase often > > needing to read more data than requested. > > > > - HBase couldn't run on YARN > > > > This makes the deployment and auto-scaling a little complicated, > especially > > in the cloud. > > > > In one word, HBase is complicated to be Kylin’s storage. The maintenance, > > debugging is also hard for normal developers. Now we’re planning to seek > a > > simple, light-weighted, read-only storage engine for Kylin. The new > > solution should have the following characteristics: > > > > - Columnar layout with compression for efficient I/O; > > - Index by each column for quick filtering and seeking; > > - MapReduce / Spark API for parallel processing; > > - HDFS compliant for scalability and availability; > > - Mature, stable and extensible; > > > > With the plugin architecture[2] introduced in Kylin 1.5, adding multiple > > storages to Kylin is possible. Some companies like Kyligence Inc and > > Meituan.com, have developed their customized storage engine for Kylin in > > their product or platform. In their experience, columnar storage is a > good > > supplement for the HBase engine. Kaisen Kang from Meituan.com has shared > > their KOD (Kylin on Druid) solution[3] in this August’s Kylin meetup in > > Beijing. > > > > We plan to do a PoC with Apache Parquet + Apache Spark in the next phase. > > Parquet is a standard columnar file format and has been widely supported > by > > many projects like Hive, Impala, Drill, etc. Parquet is adding the page > > level column index to support fine-grained filtering. Apache Spark can > > provide the parallel computing over Parquet and can be deployed on > > YARN/Mesos and Kubernetes. With this combination, the data persistence > and > > computation are separated, which makes the scaling in/out much easier > than > > before. Benefiting from Spark's flexibility, we can not only push down > more > > computation from Kylin to the Hadoop cluster. Except for Parquet, Apache > > ORC is also a candidate. > > > > Now I raise this discussion to get your ideas about Kylin’s > next-generation > > storage engine. If you have good ideas or any related data, welcome > discuss in > > the community. > > > > Thank you! > > > > [1] Apache Kylin on HBase > > > https://www.slideshare.net/ShiShaoFeng1/apache-kylin-on-hbase-extreme-olap-engine-for-big-data > > [2] Apache Kylin Plugin Architecture > > https://kylin.apache.org/development/plugin_arch.html > > [3] 基于Druid的Kylin存储引擎实践 > https://blog.bcmeng.com/post/kylin-on-druid.html-- > > Best regards, > > > > Shaofeng Shi 史少锋 > -- Best regards, Shaofeng Shi 史少锋
