[ 
https://issues.apache.org/jira/browse/FLINK-12002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BoWang updated FLINK-12002:
---------------------------
    Description: 
In Flink the parallelism of job is a pre-specified parameter, which is usually 
an empirical value and thus might not be optimal for both performance and 
resource depending on the amount of data processed in each task.

Furthermore, a fixed parallelism cannot scale to varying data size common in 
production cluster where we may not often change configurations. 

We propose to determine the job parallelism adaptive to the actual total input 
data size and an ideal data size processed by each task. The ideal size is 
pre-specified according to the properties of the operator such as the 
preparation overhead compared with data processing time.

Our basic idea of "split and merge" is to make the data dispatched evenly 
acorss Reducers by spliting and/or merging data buckets produced by Map. The 
data density skew problem is not covered. This kind of parallelism adjustment 
doesn't have data correctness issue since it doesnt' break the condition that 
data with the same key is processed by a single task.  We determine the proper 
parallelism of Reduce during scheduling before its actual running and after its 
input been ready though not necessary total input data. In such context, 
apdative parallelism is a better name. This scheduling improvement we think can 
benefit both batch and stream as long as we can obtain some clues about the 
input data.

 Design doc: 
https://docs.google.com/document/d/1ZxnoJ3SOxUk1PL2xC1t-kepq28Pg20IL6eVKUUOWSKY/edit?usp=sharing

 

 

  was:
In Flink the parallelism of job is a pre-specified parameter, which is usually 
an empirical value and thus might not be optimal for both performance and 
resource depending on the amount of data processed in each task.

Furthermore, a fixed parallelism cannot scale to varying data size common in 
production cluster where we may not often change configurations. 

We propose to determine the job parallelism adaptive to the actual total input 
data size and an ideal data size processed by each task. The ideal size is 
pre-specified according to the properties of the operator such as the 
preparation overhead compared with data processing time.

Our basic idea of "split and merge" is to make the data dispatched evenly 
acorss Reducers by spliting and/or merging data buckets produced by Map. The 
data density skew problem is not covered. This kind of parallelism adjustment 
doesn't have data correctness issue since it doesnt' break the condition that 
data with the same key is processed by a single task.  We determine the proper 
parallelism of Reduce during scheduling before its actual running and after its 
input been ready though not necessary total input data. In such context, 
apdative parallelism is a better name. This scheduling improvement we think can 
benefit both batch and stream as long as we can obtain some clues about the 
input data.

 

detailed design doc coming soon.

 

 


> Adaptive Parallelism of Job Vertex Execution
> --------------------------------------------
>
>                 Key: FLINK-12002
>                 URL: https://issues.apache.org/jira/browse/FLINK-12002
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>            Reporter: ryantaocer
>            Assignee: BoWang
>            Priority: Major
>
> In Flink the parallelism of job is a pre-specified parameter, which is 
> usually an empirical value and thus might not be optimal for both performance 
> and resource depending on the amount of data processed in each task.
> Furthermore, a fixed parallelism cannot scale to varying data size common in 
> production cluster where we may not often change configurations. 
> We propose to determine the job parallelism adaptive to the actual total 
> input data size and an ideal data size processed by each task. The ideal size 
> is pre-specified according to the properties of the operator such as the 
> preparation overhead compared with data processing time.
> Our basic idea of "split and merge" is to make the data dispatched evenly 
> acorss Reducers by spliting and/or merging data buckets produced by Map. The 
> data density skew problem is not covered. This kind of parallelism adjustment 
> doesn't have data correctness issue since it doesnt' break the condition that 
> data with the same key is processed by a single task.  We determine the 
> proper parallelism of Reduce during scheduling before its actual running and 
> after its input been ready though not necessary total input data. In such 
> context, apdative parallelism is a better name. This scheduling improvement 
> we think can benefit both batch and stream as long as we can obtain some 
> clues about the input data.
>  Design doc: 
> https://docs.google.com/document/d/1ZxnoJ3SOxUk1PL2xC1t-kepq28Pg20IL6eVKUUOWSKY/edit?usp=sharing
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to