[jira] [Commented] (FLINK-10815) Rethink the rescale operation, can we do it async

2018-11-07 Thread Shimin Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16679185#comment-16679185
 ] 

Shimin Yang commented on FLINK-10815:
-

Thanks [~dawidwys]. I will close this issue.

> Rethink the rescale operation, can we do it async
> -
>
> Key: FLINK-10815
> URL: https://issues.apache.org/jira/browse/FLINK-10815
> Project: Flink
>  Issue Type: Improvement
>  Components: ResourceManager, Scheduler
>Reporter: Shimin Yang
>Assignee: Shimin Yang
>Priority: Major
>
> Currently, the rescale operation is to stop the whole job and restart it with 
> different parrellism. But the rescale operation cost a lot and took lots of 
> time to recover if the state size is quite big. 
> And a long-time rescale might cause other problems like latency increase and 
> back pressure. For some circumstances like a streaming computing cloud 
> service, users may be very sensitive to latency and resource usage. So it 
> would be better to make the rescale a cheaper operation.
> I wonder if we could make it an async operation just like checkpoint. But how 
> to deal with the keyed state would be a pain in the ass. Currently I just 
> want to make some assumption to make things simpler. The asnyc rescale 
> operation can only double the parrellism or make it half.
> In the scale up circumstance, we can copy the state to the newly created 
> worker and change the partitioner of the upstream. The best timing might be 
> get notified of checkpoint completed. But we still need to change the 
> partitioner of upstream. So the upstream should buffer the result or block 
> the computation util the state copy finished. Then make the partitioner to 
> send differnt elements with the same key to the same downstream operator.
> In the scale down circumstance, we can merge the keyed state of two operators 
> and also change the partitioner of upstream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10815) Rethink the rescale operation, can we do it async

2018-11-07 Thread Dawid Wysakowicz (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678468#comment-16678468
 ] 

Dawid Wysakowicz commented on FLINK-10815:
--

I think the correct way to proceed with such proposal would be to post it on 
dev mailing list with [DISCUSS] tag.

> Rethink the rescale operation, can we do it async
> -
>
> Key: FLINK-10815
> URL: https://issues.apache.org/jira/browse/FLINK-10815
> Project: Flink
>  Issue Type: Improvement
>  Components: ResourceManager, Scheduler
>Reporter: Shimin Yang
>Assignee: Shimin Yang
>Priority: Major
>
> Currently, the rescale operation is to stop the whole job and restart it with 
> different parrellism. But the rescale operation cost a lot and took lots of 
> time to recover if the state size is quite big. 
> And a long-time rescale might cause other problems like latency increase and 
> back pressure. For some circumstances like a streaming computing cloud 
> service, users may be very sensitive to latency and resource usage. So it 
> would be better to make the rescale a cheaper operation.
> I wonder if we could make it an async operation just like checkpoint. But how 
> to deal with the keyed state would be a pain in the ass. Currently I just 
> want to make some assumption to make things simpler. The asnyc rescale 
> operation can only double the parrellism or make it half.
> In the scale up circumstance, we can copy the state to the newly created 
> worker and change the partitioner of the upstream. The best timing might be 
> get notified of checkpoint completed. But we still need to change the 
> partitioner of upstream. So the upstream should buffer the result or block 
> the computation util the state copy finished. Then make the partitioner to 
> send differnt elements with the same key to the same downstream operator.
> In the scale down circumstance, we can merge the keyed state of two operators 
> and also change the partitioner of upstream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10815) Rethink the rescale operation, can we do it async

2018-11-07 Thread Shimin Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678456#comment-16678456
 ] 

Shimin Yang commented on FLINK-10815:
-

[~till.rohrmann], I would like to hear your opinion before diving into more 
detail. Since I am not very sure whether this is a nice feature and is it 
worthy to work on this.

> Rethink the rescale operation, can we do it async
> -
>
> Key: FLINK-10815
> URL: https://issues.apache.org/jira/browse/FLINK-10815
> Project: Flink
>  Issue Type: Improvement
>  Components: ResourceManager, Scheduler
>Reporter: Shimin Yang
>Assignee: Shimin Yang
>Priority: Major
>
> Currently, the rescale operation is to stop the whole job and restart it with 
> different parrellism. But the rescale operation cost a lot and took lots of 
> time to recover if the state size is quite big. 
> And a long-time rescale might cause other problems like latency increase and 
> back pressure. For some circumstances like a streaming computing cloud 
> service, users may be very sensitive to latency and resource usage. So it 
> would be better to make the rescale a cheaper operation.
> I wonder if we could make it an async operation just like checkpoint. But how 
> to deal with the keyed state would be a pain in the ass. Currently I just 
> want to make some assumption to make things simpler. The asnyc rescale 
> operation can only double the parrellism or make it half.
> In the scale up circumstance, we can copy the state to the newly created 
> worker and change the partitioner of the upstream. The best timing might be 
> get notified of checkpoint completed. But we still need to change the 
> partitioner of upstream. So the upstream should buffer the result or block 
> the computation util the state copy finished. Then make the partitioner to 
> send differnt elements with the same key to the same downstream operator.
> In the scale down circumstance, we can merge the keyed state of two operators 
> and also change the partitioner of upstream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)