[ 
https://issues.apache.org/jira/browse/FLINK-27338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

luoyuxia updated FLINK-27338:
-----------------------------
    Description: 
Currently, for hive source, it'll use the hdfs block size configured with key 
dfs.block.size in hdfs-site.xml as the max split size to split the files. The 
default value is usually 128M/256M depending on configuration.

The strategy to split file is not reasonable for the number of splits tend to 
be less so that can't make good use of the parallel computing.

What's more, when enable parallelism inference for hive source, it'll set the 
parallelism of Hive source to the num of splits when it's not bigger than max 
parallelism. So, it'll limit the source parallelism and could degrade the 
perfermance.

To solve this problem, the idea is to calcuate a reasonable split size based on 
files's total size, block size,  default parallelism or parallelism configured 
by user. 

 

The Jira is try to improve the splitting file logic for Hive table with orc 
format.

  was:
Currently, for hive source, it'll use the hdfs block size configured with key 
dfs.block.size in hdfs-site.xml as the max split size to split the files. The 
default value is usually 128M/256M depending on configuration.

The strategy to split file is not reasonable for the number of splits tend to 
be less so that can't make good use of the parallel computing.

What's more, when enable parallelism inference for hive source, it'll set the 
parallelism of Hive source to the num of splits when it's not bigger than max 
parallelism. So, it'll limit the source parallelism and could degrade the 
perfermance.

To solve this problem, the idea is to calcuate a reasonable split size based on 
files's total size, block size,  default parallelism or parallelism configured 
by user. 


> Improve spliting file for Hive table with orc format
> ----------------------------------------------------
>
>                 Key: FLINK-27338
>                 URL: https://issues.apache.org/jira/browse/FLINK-27338
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Connectors / Hive
>            Reporter: luoyuxia
>            Assignee: luoyuxia
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.16.0
>
>
> Currently, for hive source, it'll use the hdfs block size configured with key 
> dfs.block.size in hdfs-site.xml as the max split size to split the files. The 
> default value is usually 128M/256M depending on configuration.
> The strategy to split file is not reasonable for the number of splits tend to 
> be less so that can't make good use of the parallel computing.
> What's more, when enable parallelism inference for hive source, it'll set the 
> parallelism of Hive source to the num of splits when it's not bigger than max 
> parallelism. So, it'll limit the source parallelism and could degrade the 
> perfermance.
> To solve this problem, the idea is to calcuate a reasonable split size based 
> on files's total size, block size,  default parallelism or parallelism 
> configured by user. 
>  
> The Jira is try to improve the splitting file logic for Hive table with orc 
> format.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to