[
https://issues.apache.org/jira/browse/KYLIN-5530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiaoxiang Yu resolved KYLIN-5530.
---------------------------------
Resolution: Fixed
> Build Performance Optimization
> ------------------------------
>
> Key: KYLIN-5530
> URL: https://issues.apache.org/jira/browse/KYLIN-5530
> Project: Kylin
> Issue Type: Improvement
> Components: Job Engine
> Affects Versions: 5.0-alpha
> Reporter: Yaguang Jia
> Assignee: Yaguang Jia
> Priority: Major
> Fix For: 5.0-beta
>
> Attachments: (Chinese) KYLIN-5530 Build Performance Optimization.pdf,
> (English) KYLIN-5530 Build Performance Optimization.pdf
>
>
> 1. remove the repartitionWriter method for building indexes
> Background: repartition this behavior on the cloud due to the read and write
> IO problems of object storage, the implementation costs are too high, which
> brings more significant problems.
> The current index construction needs to write index data to temp directory
> first, and then read and repartition into new data files for storage. This
> method of wasting a lot of IO needs to be removed and modified to directly
> repartition write into the final index file, transforming spark's
> repartition, which needs to achieve the following goals:
> - Solve the scenario of skew
> - solve the problem of a large number of small files
> 2. When building a Flat Table, the dimension table directly reads the
> Snapshot file
> The reasons are as follows:
> - If the dimension table is a view, the view will be calculated once when
> building a snapshot, and once when building a flat table, so once building a
> dimension table view, it will be calculated twice.
> - There are uncertainties in the data format of the source data, etc.
> Optimization direction: When building a flat table, the dimension table does
> not read from the source data, but directly reads the Snapshot file data
>
> ---
>
> 1. 去除构建索引的repartitionWriter方法
> 背景:repartition这个行为在云上由于对象存储的读写IO问题,实现成本太高,带来的问题就比较显著。
> 当前索引的构建需要先将索引数据写到temp目录,再读取之后repartition成新的数据文件存储。需要去除这种浪费大量IO的方法,修改为直接repartition写成最终的索引文件,改造spark的repartition,需要达成以下目标:
> - 解决skew的场景
> - 解决大量小文件的问题
>
> 2. 构建Flat Table时维表直接读取Snapshot的文件
> 原因如下:
> - 如果维表为view,构建snapshot时会计算一次view,构建Flat Table时会计算一次,所以一次构建维表view会计算两次。
> - 源数据的数据格式等存在不确定性
> 优化方向:构建平表时,维表不从源数据读取,直接读取Snapshot文件数据
--
This message was sent by Atlassian Jira
(v8.20.10#820010)