[jira] [Updated] (KYLIN-5530) Build Performance Optimization

Yaguang Jia (Jira) Tue, 25 Apr 2023 21:13:04 -0700


     [ 
https://issues.apache.org/jira/browse/KYLIN-5530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yaguang Jia updated KYLIN-5530:
-------------------------------
    Attachment: (English) KYLIN-5530 Build Performance Optimization.pdf

> Build Performance Optimization
> ------------------------------
>
>                 Key: KYLIN-5530
>                 URL: https://issues.apache.org/jira/browse/KYLIN-5530
>             Project: Kylin
>          Issue Type: Improvement
>          Components: Job Engine
>    Affects Versions: 5.0-alpha
>            Reporter: Yaguang Jia
>            Assignee: Yaguang Jia
>            Priority: Major
>             Fix For: 5.0-beta
>
>         Attachments: (Chinese) KYLIN-5530 Build Performance Optimization.pdf, 
> (English) KYLIN-5530 Build Performance Optimization.pdf
>
>
> 1. remove the repartitionWriter method for building indexes
> Background: repartition this behavior on the cloud due to the read and write 
> IO problems of object storage, the implementation costs are too high, which 
> brings more significant problems.
> The current index construction needs to write index data to temp directory 
> first, and then read and repartition into new data files for storage. This 
> method of wasting a lot of IO needs to be removed and modified to directly 
> repartition write into the final index file, transforming spark's 
> repartition, which needs to achieve the following goals:
> - Solve the scenario of skew
> - solve the problem of a large number of small files
> 2. When building a Flat Table, the dimension table directly reads the 
> Snapshot file
> The reasons are as follows:
> - If the dimension table is a view, the view will be calculated once when 
> building a snapshot, and once when building a flat table, so once building a 
> dimension table view, it will be calculated twice.
> - There are uncertainties in the data format of the source data, etc.
> Optimization direction: When building a flat table, the dimension table does 
> not read from the source data, but directly reads the Snapshot file data
>  
> ---
>  
> 1. 去除构建索引的repartitionWriter方法
> 背景：repartition这个行为在云上由于对象存储的读写IO问题，实现成本太高，带来的问题就比较显著。
> 当前索引的构建需要先将索引数据写到temp目录，再读取之后repartition成新的数据文件存储。需要去除这种浪费大量IO的方法，修改为直接repartition写成最终的索引文件，改造spark的repartition，需要达成以下目标：
> - 解决skew的场景
> - 解决大量小文件的问题
>  
> 2. 构建Flat Table时维表直接读取Snapshot的文件
> 原因如下：
> - 如果维表为view，构建snapshot时会计算一次view，构建Flat Table时会计算一次，所以一次构建维表view会计算两次。
> - 源数据的数据格式等存在不确定性
> 优化方向：构建平表时，维表不从源数据读取，直接读取Snapshot文件数据



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KYLIN-5530) Build Performance Optimization

Reply via email to