[jira] [Updated] (FLINK-27696) Add bin-pack strategy to split the whole bucket data files into several small splits for append-only table.

Jingsong Lee (Jira) Sun, 19 Jun 2022 20:20:07 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-27696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jingsong Lee updated FLINK-27696:
---------------------------------
    Description: 
We don't have to assign each task with a whole bucket data files. Instead, we 
can use some algorithm ( such as bin-packing) to split the whole bucket data 
files into multiple fragments to improve the job parallelism.

For merge tree table:
Suppose now there are files: [1, 2] [3, 4] [5, 180] [5, 190] [200, 600] [210, 
700]
Files without intersection are not related, we do not need to put all files 
into one split, we can slice into multiple splits, multiple parallelism 
execution is faster. Nor can we slice too fine, we should make each split as 
large as possible with 128 MB, so use BinPack to slice, the final result will 
be:
 * split1: [1, 2] [3, 4]
 * split2: [5, 180] [5, 190]
 * split3: [200, 600] [210, 700]

  was:For append-only table,  we don't have to assign each task with a whole 
bucket data files. Instead,  we can use some algorithm ( such as bin-packing) 
to split the whole bucket data files into multiple fragments  to improve the 
job parallelism.


> Add bin-pack strategy to split the whole bucket data files into several small 
> splits for append-only table.
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-27696
>                 URL: https://issues.apache.org/jira/browse/FLINK-27696
>             Project: Flink
>          Issue Type: Sub-task
>            Reporter: Zheng Hu
>            Assignee: Jingsong Lee
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: table-store-0.2.0
>
>
> We don't have to assign each task with a whole bucket data files. Instead, we 
> can use some algorithm ( such as bin-packing) to split the whole bucket data 
> files into multiple fragments to improve the job parallelism.
> For merge tree table:
> Suppose now there are files: [1, 2] [3, 4] [5, 180] [5, 190] [200, 600] [210, 
> 700]
> Files without intersection are not related, we do not need to put all files 
> into one split, we can slice into multiple splits, multiple parallelism 
> execution is faster. Nor can we slice too fine, we should make each split as 
> large as possible with 128 MB, so use BinPack to slice, the final result will 
> be:
>  * split1: [1, 2] [3, 4]
>  * split2: [5, 180] [5, 190]
>  * split3: [200, 600] [210, 700]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (FLINK-27696) Add bin-pack strategy to split the whole bucket data files into several small splits for append-only table.

Reply via email to