Hi Yanghong, Thanks for your question. I think it is not required that other engines know how to read Kylin's storage, but it is a nice to have if possible. We can extend the file format if Parquet or ORC couldn't match Kylin's requirement, but not necessary to re-invent a new format.
Zhong, Yanghong <yangzh...@ebay.com.invalid> 于2018年9月29日周六 上午10:59写道: > I have one question about the characteristics of Kylin columnar storage > files. That is whether it should be a standard or common one. Since the > data stored in the storage engine is Kylin specified, is it necessary for > other engines to know how to build data into and how to read data from the > storage engine? > > In my opinion, it's not necessary. And Kylin columnar storage files should > be Kylin specified. We can leverage the advantages of other columnar files, > like data skip indexes, bloom filters, dictionaries. Then create a new file > format with Kylin specified requirements, like cuboid info. > > ------ > Best regards, > Yanghong Zhong > > > On 9/28/18, 2:15 PM, "ShaoFeng Shi" <shaofeng...@apache.org> wrote: > > Hi Kylin developers. > > HBase has been Kylin’s storage engine since the first day; Kylin on > HBase > has been verified as a success which can support low latency & high > concurrency queries on a very large data scale. Thanks to HBase, most > Kylin > users can get on average less than 1-second query response. > > But we also see some limitations when putting Cubes into HBase; I > shared > some of them in the HBaseConf Asia 2018[1] this August. The typical > limitations include: > > - Rowkey is the primary index, no secondary index so far; > > Filtering by row key’s prefix and suffix can get very different > performance > result. So the user needs to do a good design about the row key; > otherwise, > the query would be slow. This is difficult sometimes because the user > might > not predict the filtering patterns ahead of cube design. > > - HBase is a key-value instead of a columnar storage > > Kylin combines multiple measures (columns) into fewer column families > for > smaller data size (row key size is remarkable). This causes HBase often > needing to read more data than requested. > > - HBase couldn't run on YARN > > This makes the deployment and auto-scaling a little complicated, > especially > in the cloud. > > In one word, HBase is complicated to be Kylin’s storage. The > maintenance, > debugging is also hard for normal developers. Now we’re planning to > seek a > simple, light-weighted, read-only storage engine for Kylin. The new > solution should have the following characteristics: > > - Columnar layout with compression for efficient I/O; > - Index by each column for quick filtering and seeking; > - MapReduce / Spark API for parallel processing; > - HDFS compliant for scalability and availability; > - Mature, stable and extensible; > > With the plugin architecture[2] introduced in Kylin 1.5, adding > multiple > storages to Kylin is possible. Some companies like Kyligence Inc and > Meituan.com, have developed their customized storage engine for Kylin > in > their product or platform. In their experience, columnar storage is a > good > supplement for the HBase engine. Kaisen Kang from Meituan.com has > shared > their KOD (Kylin on Druid) solution[3] in this August’s Kylin meetup in > Beijing. > > We plan to do a PoC with Apache Parquet + Apache Spark in the next > phase. > Parquet is a standard columnar file format and has been widely > supported by > many projects like Hive, Impala, Drill, etc. Parquet is adding the page > level column index to support fine-grained filtering. Apache Spark can > provide the parallel computing over Parquet and can be deployed on > YARN/Mesos and Kubernetes. With this combination, the data persistence > and > computation are separated, which makes the scaling in/out much easier > than > before. Benefiting from Spark's flexibility, we can not only push down > more > computation from Kylin to the Hadoop cluster. Except for Parquet, > Apache > ORC is also a candidate. > > Now I raise this discussion to get your ideas about Kylin’s > next-generation > storage engine. If you have good ideas or any related data, welcome > discuss in > the community. > > Thank you! > > [1] Apache Kylin on HBase > > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2FShiShaoFeng1%2Fapache-kylin-on-hbase-extreme-olap-engine-for-big-data&data=02%7C01%7Cyangzhong%40ebay.com%7C71e694ab5386420bb32908d62509c003%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636737121143223312&sdata=TuIOe6FxdubqsoRVX8BQb%2FkvSFRrfI0ZvBRDB0euZWk%3D&reserved=0 > [2] Apache Kylin Plugin Architecture > > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkylin.apache.org%2Fdevelopment%2Fplugin_arch.html&data=02%7C01%7Cyangzhong%40ebay.com%7C71e694ab5386420bb32908d62509c003%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636737121143223312&sdata=6WPLbX9Rat51rj3VCc1AuVDxTw5HO2ezPO0Cj8m231g%3D&reserved=0 > [3] 基于Druid的Kylin存储引擎实践 > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fblog.bcmeng.com%2Fpost%2Fkylin-on-druid.html--&data=02%7C01%7Cyangzhong%40ebay.com%7C71e694ab5386420bb32908d62509c003%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636737121143223312&sdata=A2j40L1%2BcoccgZSRGs4X%2F5TUDi2VQqjhdNoMThfJffA%3D&reserved=0 > Best regards, > > Shaofeng Shi 史少锋 > > > -- Best regards, Shaofeng Shi 史少锋