[ https://issues.apache.org/jira/browse/FLINK-27338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
luoyuxia updated FLINK-27338: ----------------------------- Summary: Improve spliting file for Hive table with orc format (was: Improve spliting file for Hive soure) > Improve spliting file for Hive table with orc format > ---------------------------------------------------- > > Key: FLINK-27338 > URL: https://issues.apache.org/jira/browse/FLINK-27338 > Project: Flink > Issue Type: Sub-task > Components: Connectors / Hive > Reporter: luoyuxia > Assignee: luoyuxia > Priority: Major > Labels: pull-request-available > Fix For: 1.16.0 > > > Currently, for hive source, it'll use the hdfs block size configured with key > dfs.block.size in hdfs-site.xml as the max split size to split the files. The > default value is usually 128M/256M depending on configuration. > The strategy to split file is not reasonable for the number of splits tend to > be less so that can't make good use of the parallel computing. > What's more, when enable parallelism inference for hive source, it'll set the > parallelism of Hive source to the num of splits when it's not bigger than max > parallelism. So, it'll limit the source parallelism and could degrade the > perfermance. > To solve this problem, the idea is to calcuate a reasonable split size based > on files's total size, block size, default parallelism or parallelism > configured by user. -- This message was sent by Atlassian Jira (v8.20.10#820010)