[jira] [Comment Edited] (FLINK-10815) Rethink the rescale operation, can we do it async

Shimin Yang (JIRA) Wed, 07 Nov 2018 08:20:58 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-10815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16678456#comment-16678456
 ]


Shimin Yang edited comment on FLINK-10815 at 11/7/18 4:19 PM:
--------------------------------------------------------------

[~till.rohrmann], I would like to hear your opinion before diving into more 
detail. Since I am not very sure whether this is a nice feature and is it 
worthy to work on this. Currently, it's more like a proposal.


was (Author: dangdangdang):
[~till.rohrmann], I would like to hear your opinion before diving into more 
detail. Since I am not very sure whether this is a nice feature and is it 
worthy to work on this.

> Rethink the rescale operation, can we do it async
> -------------------------------------------------
>
>                 Key: FLINK-10815
>                 URL: https://issues.apache.org/jira/browse/FLINK-10815
>             Project: Flink
>          Issue Type: Improvement
>          Components: ResourceManager, Scheduler
>            Reporter: Shimin Yang
>            Assignee: Shimin Yang
>            Priority: Major
>
> Currently, the rescale operation is to stop the whole job and restart it with 
> different parrellism. But the rescale operation cost a lot and took lots of 
> time to recover if the state size is quite big. 
> And a long-time rescale might cause other problems like latency increase and 
> back pressure. For some circumstances like a streaming computing cloud 
> service, users may be very sensitive to latency and resource usage. So it 
> would be better to make the rescale a cheaper operation.
> I wonder if we could make it an async operation just like checkpoint. But how 
> to deal with the keyed state would be a pain in the ass. Currently I just 
> want to make some assumption to make things simpler. The asnyc rescale 
> operation can only double the parrellism or make it half.
> In the scale up circumstance, we can copy the state to the newly created 
> worker and change the partitioner of the upstream. The best timing might be 
> get notified of checkpoint completed. But we still need to change the 
> partitioner of upstream. So the upstream should buffer the result or block 
> the computation util the state copy finished. Then make the partitioner to 
> send differnt elements with the same key to the same downstream operator.
> In the scale down circumstance, we can merge the keyed state of two operators 
> and also change the partitioner of upstream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (FLINK-10815) Rethink the rescale operation, can we do it async

Reply via email to