[jira] [Commented] (KYLIN-5949) Support DeltaLake as Index storage

pengfei.zhan (Jira) Mon, 05 Aug 2024 03:26:04 -0700


    [ 
https://issues.apache.org/jira/browse/KYLIN-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871019#comment-17871019
 ]


pengfei.zhan commented on KYLIN-5949:
-------------------------------------

h1. design goal


Segment Logicalization: Eliminate the setting of Segment to manage index data, 
Segment only retains the logical concepts
Index storage as a table: according to the different types of indexes, set up 
different table types, index tabularization can make better use of the query 
engine's ability to handle tables.
Index storage type can be extended: default storage is replaced from Parquet to 
Delta Lake, and Iceberg and Hudi can be supported for fast replacement.
Dynamic tuning of build and query runtime parameters: Dynamic tuning of 
execution engine parameters at runtime (build and query) according to index 
characteristics.
Stability of query performance: query performance should be relatively 
consistent for both early and recent data.
Index targeted optimization capability: according to a specific query, targeted 
optimization of the corresponding index, the ability to specific query extreme 
acceleration
h1. 
Storage format changes


Original Segment + parquet storage


# V1 Cube results data file structure
parquet/
└── dc65dd61-dbe3-8f46-7d44-668b688b96c1 (模型 ID)
    └── 12d2c4c1-248f-b1f8-0bdb-88b0eb9c8580 （Segment ID）
        ├── 1 (聚合索引ID)
        │   └── 
part-00000-393b8b08-84fc-40c6-8c2e-d579485dcc57-c000.snappy.parquet（数据）
        ├── 10001
        ├── 20001
        ├── 30001
        ├── 40001
        └── 20000000001（明细索引ID）


V3 file format - data is organized by delta lake and stored in Parquet format

!image-2024-08-05-17-48-53-286.png|width=666,height=391!

> Support DeltaLake as Index storage
> ----------------------------------
>
>                 Key: KYLIN-5949
>                 URL: https://issues.apache.org/jira/browse/KYLIN-5949
>             Project: Kylin
>          Issue Type: New Feature
>          Components: Job Engine, Query Engine
>    Affects Versions: 5.0.0
>            Reporter: pengfei.zhan
>            Assignee: Zhimin Wu
>            Priority: Major
>             Fix For: 5.0.0
>
>         Attachments: image-2024-08-05-17-48-53-286.png, image.png
>
>
> h3. 设计目标
>  # Segment逻辑化：取消Segment管理索引数据的设定，Segment只保留逻辑概念
>  # 索引存储为表：根据索引类型的不同，设定不同的表类型，索引表化可以更好的利用查询引擎对于表处理的能力
>  # 索引存储类型可扩展：默认存储从Parquet替换为Delta Lake，同时可以支持Iceberg以及Hudi的快速替换
>  # 构建和查询的运行时参数动态调整：按照索引的特性，在运行时(构建和查询)动态调整执行引擎参数
>  # 查询效果的稳定：无论是早期还是近期数据，应该保持相对一致的查询性能
>  # 索引定向优化的能力：能够根据特定的查询，定向优化相对应的索引，能够对特定查询极致加速
> h3. 存储格式的变化
> h4. 原 Segment +  parquet 存储
> {code:java}
> # V1 Cube结果数据文件结构
> parquet/
> └── dc65dd61-dbe3-8f46-7d44-668b688b96c1 (模型 ID)
>     └── 12d2c4c1-248f-b1f8-0bdb-88b0eb9c8580 （Segment ID）
>         ├── 1 (聚合索引ID)
>         │   └── 
> part-00000-393b8b08-84fc-40c6-8c2e-d579485dcc57-c000.snappy.parquet（数据）
>         ├── 10001
>         ├── 20001
>         ├── 30001
>         ├── 40001
>         └── 20000000001（明细索引ID）{code}
> h4. V3文件格式 - 数据由 delta lake 组织，以 Parquet 形式存储
> !image-2024-08-05-17-48-53-286.png|width=666,height=391!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KYLIN-5949) Support DeltaLake as Index storage

Reply via email to