[jira] [Commented] (FLINK-10815) Rethink the rescale operation, can we do it async
[ https://issues.apache.org/jira/browse/FLINK-10815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16679185#comment-16679185 ] Shimin Yang commented on FLINK-10815: - Thanks [~dawidwys]. I will close this issue. > Rethink the rescale operation, can we do it async > - > > Key: FLINK-10815 > URL: https://issues.apache.org/jira/browse/FLINK-10815 > Project: Flink > Issue Type: Improvement > Components: ResourceManager, Scheduler >Reporter: Shimin Yang >Assignee: Shimin Yang >Priority: Major > > Currently, the rescale operation is to stop the whole job and restart it with > different parrellism. But the rescale operation cost a lot and took lots of > time to recover if the state size is quite big. > And a long-time rescale might cause other problems like latency increase and > back pressure. For some circumstances like a streaming computing cloud > service, users may be very sensitive to latency and resource usage. So it > would be better to make the rescale a cheaper operation. > I wonder if we could make it an async operation just like checkpoint. But how > to deal with the keyed state would be a pain in the ass. Currently I just > want to make some assumption to make things simpler. The asnyc rescale > operation can only double the parrellism or make it half. > In the scale up circumstance, we can copy the state to the newly created > worker and change the partitioner of the upstream. The best timing might be > get notified of checkpoint completed. But we still need to change the > partitioner of upstream. So the upstream should buffer the result or block > the computation util the state copy finished. Then make the partitioner to > send differnt elements with the same key to the same downstream operator. > In the scale down circumstance, we can merge the keyed state of two operators > and also change the partitioner of upstream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10815) Rethink the rescale operation, can we do it async
[ https://issues.apache.org/jira/browse/FLINK-10815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678468#comment-16678468 ] Dawid Wysakowicz commented on FLINK-10815: -- I think the correct way to proceed with such proposal would be to post it on dev mailing list with [DISCUSS] tag. > Rethink the rescale operation, can we do it async > - > > Key: FLINK-10815 > URL: https://issues.apache.org/jira/browse/FLINK-10815 > Project: Flink > Issue Type: Improvement > Components: ResourceManager, Scheduler >Reporter: Shimin Yang >Assignee: Shimin Yang >Priority: Major > > Currently, the rescale operation is to stop the whole job and restart it with > different parrellism. But the rescale operation cost a lot and took lots of > time to recover if the state size is quite big. > And a long-time rescale might cause other problems like latency increase and > back pressure. For some circumstances like a streaming computing cloud > service, users may be very sensitive to latency and resource usage. So it > would be better to make the rescale a cheaper operation. > I wonder if we could make it an async operation just like checkpoint. But how > to deal with the keyed state would be a pain in the ass. Currently I just > want to make some assumption to make things simpler. The asnyc rescale > operation can only double the parrellism or make it half. > In the scale up circumstance, we can copy the state to the newly created > worker and change the partitioner of the upstream. The best timing might be > get notified of checkpoint completed. But we still need to change the > partitioner of upstream. So the upstream should buffer the result or block > the computation util the state copy finished. Then make the partitioner to > send differnt elements with the same key to the same downstream operator. > In the scale down circumstance, we can merge the keyed state of two operators > and also change the partitioner of upstream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10815) Rethink the rescale operation, can we do it async
[ https://issues.apache.org/jira/browse/FLINK-10815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678456#comment-16678456 ] Shimin Yang commented on FLINK-10815: - [~till.rohrmann], I would like to hear your opinion before diving into more detail. Since I am not very sure whether this is a nice feature and is it worthy to work on this. > Rethink the rescale operation, can we do it async > - > > Key: FLINK-10815 > URL: https://issues.apache.org/jira/browse/FLINK-10815 > Project: Flink > Issue Type: Improvement > Components: ResourceManager, Scheduler >Reporter: Shimin Yang >Assignee: Shimin Yang >Priority: Major > > Currently, the rescale operation is to stop the whole job and restart it with > different parrellism. But the rescale operation cost a lot and took lots of > time to recover if the state size is quite big. > And a long-time rescale might cause other problems like latency increase and > back pressure. For some circumstances like a streaming computing cloud > service, users may be very sensitive to latency and resource usage. So it > would be better to make the rescale a cheaper operation. > I wonder if we could make it an async operation just like checkpoint. But how > to deal with the keyed state would be a pain in the ass. Currently I just > want to make some assumption to make things simpler. The asnyc rescale > operation can only double the parrellism or make it half. > In the scale up circumstance, we can copy the state to the newly created > worker and change the partitioner of the upstream. The best timing might be > get notified of checkpoint completed. But we still need to change the > partitioner of upstream. So the upstream should buffer the result or block > the computation util the state copy finished. Then make the partitioner to > send differnt elements with the same key to the same downstream operator. > In the scale down circumstance, we can merge the keyed state of two operators > and also change the partitioner of upstream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)