[ 
https://issues.apache.org/jira/browse/KYLIN-6025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17928302#comment-17928302
 ] 

Guoliang Sun commented on KYLIN-6025:
-------------------------------------

h3. Dev Design

Before writing to the internal table, sort the data by the partition column 
(`partition col`) to ensure that data for the same partition is distributed to 
the same task as much as possible during data distribution.  

Add a configuration to control whether sorting should be performed: 
`kylin.internal-table.sort-by-partition.enabled`, with a default value of 
`true`. This configuration supports both system-level and project-level 
settings.  

Additionally, provide a table-level configuration `sortByPartition` with the 
highest priority. This can only be configured via the API by specifying 
`tbl_properties` in the request when creating or updating an internal table.

> Support file merging within partitions for internal tables
> ----------------------------------------------------------
>
>                 Key: KYLIN-6025
>                 URL: https://issues.apache.org/jira/browse/KYLIN-6025
>             Project: Kylin
>          Issue Type: New Feature
>    Affects Versions: 5.0.0
>            Reporter: Guoliang Sun
>            Priority: Major
>
> When multiple tasks write to the same internal table partition during the 
> build phase, the data is written into multiple subdirectories, which can 
> easily lead to an excessive number of files and increase HDFS pressure. A 
> reasonable merging mechanism is needed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to