[ https://issues.apache.org/jira/browse/FLINK-10815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16678456#comment-16678456 ]
Shimin Yang edited comment on FLINK-10815 at 11/7/18 4:19 PM: -------------------------------------------------------------- [~till.rohrmann], I would like to hear your opinion before diving into more detail. Since I am not very sure whether this is a nice feature and is it worthy to work on this. Currently, it's more like a proposal. was (Author: dangdangdang): [~till.rohrmann], I would like to hear your opinion before diving into more detail. Since I am not very sure whether this is a nice feature and is it worthy to work on this. > Rethink the rescale operation, can we do it async > ------------------------------------------------- > > Key: FLINK-10815 > URL: https://issues.apache.org/jira/browse/FLINK-10815 > Project: Flink > Issue Type: Improvement > Components: ResourceManager, Scheduler > Reporter: Shimin Yang > Assignee: Shimin Yang > Priority: Major > > Currently, the rescale operation is to stop the whole job and restart it with > different parrellism. But the rescale operation cost a lot and took lots of > time to recover if the state size is quite big. > And a long-time rescale might cause other problems like latency increase and > back pressure. For some circumstances like a streaming computing cloud > service, users may be very sensitive to latency and resource usage. So it > would be better to make the rescale a cheaper operation. > I wonder if we could make it an async operation just like checkpoint. But how > to deal with the keyed state would be a pain in the ass. Currently I just > want to make some assumption to make things simpler. The asnyc rescale > operation can only double the parrellism or make it half. > In the scale up circumstance, we can copy the state to the newly created > worker and change the partitioner of the upstream. The best timing might be > get notified of checkpoint completed. But we still need to change the > partitioner of upstream. So the upstream should buffer the result or block > the computation util the state copy finished. Then make the partitioner to > send differnt elements with the same key to the same downstream operator. > In the scale down circumstance, we can merge the keyed state of two operators > and also change the partitioner of upstream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)