[ https://issues.apache.org/jira/browse/KYLIN-5530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yaguang Jia updated KYLIN-5530: ------------------------------- Attachment: (English) KYLIN-5530 Build Performance Optimization.pdf > Build Performance Optimization > ------------------------------ > > Key: KYLIN-5530 > URL: https://issues.apache.org/jira/browse/KYLIN-5530 > Project: Kylin > Issue Type: Improvement > Components: Job Engine > Affects Versions: 5.0-alpha > Reporter: Yaguang Jia > Assignee: Yaguang Jia > Priority: Major > Fix For: 5.0-beta > > Attachments: (Chinese) KYLIN-5530 Build Performance Optimization.pdf, > (English) KYLIN-5530 Build Performance Optimization.pdf > > > 1. remove the repartitionWriter method for building indexes > Background: repartition this behavior on the cloud due to the read and write > IO problems of object storage, the implementation costs are too high, which > brings more significant problems. > The current index construction needs to write index data to temp directory > first, and then read and repartition into new data files for storage. This > method of wasting a lot of IO needs to be removed and modified to directly > repartition write into the final index file, transforming spark's > repartition, which needs to achieve the following goals: > - Solve the scenario of skew > - solve the problem of a large number of small files > 2. When building a Flat Table, the dimension table directly reads the > Snapshot file > The reasons are as follows: > - If the dimension table is a view, the view will be calculated once when > building a snapshot, and once when building a flat table, so once building a > dimension table view, it will be calculated twice. > - There are uncertainties in the data format of the source data, etc. > Optimization direction: When building a flat table, the dimension table does > not read from the source data, but directly reads the Snapshot file data > > --- > > 1. 去除构建索引的repartitionWriter方法 > 背景:repartition这个行为在云上由于对象存储的读写IO问题,实现成本太高,带来的问题就比较显著。 > 当前索引的构建需要先将索引数据写到temp目录,再读取之后repartition成新的数据文件存储。需要去除这种浪费大量IO的方法,修改为直接repartition写成最终的索引文件,改造spark的repartition,需要达成以下目标: > - 解决skew的场景 > - 解决大量小文件的问题 > > 2. 构建Flat Table时维表直接读取Snapshot的文件 > 原因如下: > - 如果维表为view,构建snapshot时会计算一次view,构建Flat Table时会计算一次,所以一次构建维表view会计算两次。 > - 源数据的数据格式等存在不确定性 > 优化方向:构建平表时,维表不从源数据读取,直接读取Snapshot文件数据 -- This message was sent by Atlassian Jira (v8.20.10#820010)