luoyuxia created FLINK-27338:
--------------------------------
Summary: Improve spliting file for Hive soure
Key: FLINK-27338
URL: https://issues.apache.org/jira/browse/FLINK-27338
Project: Flink
Issue Type: Improvement
Components: Connectors / Hive
Reporter: luoyuxia
Currently, for hive source, it'll use the hdfs block size configured as
dfs.block.size as the max split size to split the files. The default value is
usually 128M/256M depending on configuration.
The strategy to split file is not reasonable for each split will tend to be
larger so that can't make good use of the parallel computing .
What's more, when enable parallelism inference for hive source, it'll set the
parallelism of Hive source to the num of splits when it's not bigger than max
parallelism. So, it'll limit the source parallelism and could degrade the
perfermance.
To solve this problem, the idea is to calcuate a reasonable split size based on
files's total size, block size, default parallelism or parallelism configured
by user.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)