I like parquet, it is very efficient format and supported by various projects, 
but there are some questions if we use parquet as the cube storage format:


1. Is it possible to locate a cuboid quickly in a parquet file? How to save 
cuboid metadata info in the parquet's FileMetaData, just in the metadata's 
key/value pair?


2. I notice that there is schema field in parquet's FileMetaData, but in a 
cube, different cuboids have different schemas, so we just save the basic 
cuboid schema in the schema field?  Will this cause storage waste?


3. Can parquet support extension to add index easily, like bitmap index or B 
tree index for each column?


4. Do we need to build rpc server? if just use yarn to schedule spark tasks to 
do query, start/stop jvm may take seconds, then most queries will be slower 
than using HBase. Of course, it is more scalable, and some queries maybe faster.


Besides using parquet/orc, I think there are two other options:


1. Use customized columnar format, it is more flexible, we can add Kylin 
specific concepts in the storage, like cuboid, etc. also it will be easy to add 
different type index as we need. The disadvantage is need more effort to define 
the format and development(cannot leverage existing lib to read/write, and need 
to take care of compression), also cube data file cannot be used by other 
projects(Do we have this needs?).


2. Use local storage rather than HDFS, like Kudu/Druid/ClickHouse. Advantage of 
this solution is the query performance will be very good, and everything can be 
controlled by Kylin. Disvantage is need more effort to do the development, 
especially for the cluster management, fail over, scalability.








At 2018-09-29 10:53:35, "Zhong, Yanghong" <yangzh...@ebay.com.INVALID> wrote:
>I have one question about the characteristics of Kylin columnar storage files. 
>That is whether it should be a standard or common one. Since the data stored 
>in the storage engine is Kylin specified, is it necessary for other engines to 
>know how to build data into and how to read data from the storage engine? 
>
>In my opinion, it's not necessary. And Kylin columnar storage files should be 
>Kylin specified. We can leverage the advantages of other columnar files, like 
>data skip indexes, bloom filters, dictionaries. Then create a new file format 
>with Kylin specified requirements, like cuboid info.
>
>------
>Best regards,
>Yanghong Zhong
>
>
>On 9/28/18, 2:15 PM, "ShaoFeng Shi" <shaofeng...@apache.org> wrote:
>
>    Hi Kylin developers.
>    
>    HBase has been Kylin’s storage engine since the first day; Kylin on HBase
>    has been verified as a success which can support low latency & high
>    concurrency queries on a very large data scale. Thanks to HBase, most Kylin
>    users can get on average less than 1-second query response.
>    
>    But we also see some limitations when putting Cubes into HBase; I shared
>    some of them in the HBaseConf Asia 2018[1] this August. The typical
>    limitations include:
>    
>       - Rowkey is the primary index, no secondary index so far;
>    
>    Filtering by row key’s prefix and suffix can get very different performance
>    result. So the user needs to do a good design about the row key; otherwise,
>    the query would be slow. This is difficult sometimes because the user might
>    not predict the filtering patterns ahead of cube design.
>    
>       - HBase is a key-value instead of a columnar storage
>    
>    Kylin combines multiple measures (columns) into fewer column families for
>    smaller data size (row key size is remarkable). This causes HBase often
>    needing to read more data than requested.
>    
>       - HBase couldn't run on YARN
>    
>    This makes the deployment and auto-scaling a little complicated, especially
>    in the cloud.
>    
>    In one word, HBase is complicated to be Kylin’s storage. The maintenance,
>    debugging is also hard for normal developers. Now we’re planning to seek a
>    simple, light-weighted, read-only storage engine for Kylin. The new
>    solution should have the following characteristics:
>    
>       - Columnar layout with compression for efficient I/O;
>       - Index by each column for quick filtering and seeking;
>       - MapReduce / Spark API for parallel processing;
>       - HDFS compliant for scalability and availability;
>       - Mature, stable and extensible;
>    
>    With the plugin architecture[2] introduced in Kylin 1.5, adding multiple
>    storages to Kylin is possible. Some companies like Kyligence Inc and
>    Meituan.com, have developed their customized storage engine for Kylin in
>    their product or platform. In their experience, columnar storage is a good
>    supplement for the HBase engine. Kaisen Kang from Meituan.com has shared
>    their KOD (Kylin on Druid) solution[3] in this August’s Kylin meetup in
>    Beijing.
>    
>    We plan to do a PoC with Apache Parquet + Apache Spark in the next phase.
>    Parquet is a standard columnar file format and has been widely supported by
>    many projects like Hive, Impala, Drill, etc. Parquet is adding the page
>    level column index to support fine-grained filtering.  Apache Spark can
>    provide the parallel computing over Parquet and can be deployed on
>    YARN/Mesos and Kubernetes. With this combination, the data persistence and
>    computation are separated, which makes the scaling in/out much easier than
>    before. Benefiting from Spark's flexibility, we can not only push down more
>    computation from Kylin to the Hadoop cluster. Except for Parquet, Apache
>    ORC is also a candidate.
>    
>    Now I raise this discussion to get your ideas about Kylin’s next-generation
>    storage engine. If you have good ideas or any related data, welcome 
> discuss in
>    the community.
>    
>    Thank you!
>    
>    [1] Apache Kylin on HBase
>    
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2FShiShaoFeng1%2Fapache-kylin-on-hbase-extreme-olap-engine-for-big-data&amp;data=02%7C01%7Cyangzhong%40ebay.com%7C71e694ab5386420bb32908d62509c003%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636737121143223312&amp;sdata=TuIOe6FxdubqsoRVX8BQb%2FkvSFRrfI0ZvBRDB0euZWk%3D&amp;reserved=0
>    [2] Apache Kylin Plugin Architecture
>    
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkylin.apache.org%2Fdevelopment%2Fplugin_arch.html&amp;data=02%7C01%7Cyangzhong%40ebay.com%7C71e694ab5386420bb32908d62509c003%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636737121143223312&amp;sdata=6WPLbX9Rat51rj3VCc1AuVDxTw5HO2ezPO0Cj8m231g%3D&amp;reserved=0
>    [3] 基于Druid的Kylin存储引擎实践 
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fblog.bcmeng.com%2Fpost%2Fkylin-on-druid.html--&amp;data=02%7C01%7Cyangzhong%40ebay.com%7C71e694ab5386420bb32908d62509c003%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636737121143223312&amp;sdata=A2j40L1%2BcoccgZSRGs4X%2F5TUDi2VQqjhdNoMThfJffA%3D&amp;reserved=0
>    Best regards,
>    
>    Shaofeng Shi 史少锋
>    
>

Reply via email to