[jira] [Commented] (FLINK-13056) Optimize region failover performance on calculating vertices to restart

2019-12-08 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16991173#comment-16991173
 ] 

Zhu Zhu commented on FLINK-13056:
-

postpone this improvement to 1.11.

> Optimize region failover performance on calculating vertices to restart
> ---
>
> Key: FLINK-13056
> URL: https://issues.apache.org/jira/browse/FLINK-13056
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.9.0, 1.10.0
>Reporter: Zhu Zhu
>Assignee: Zhu Zhu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently some region boundary structures are calculated each time of a 
> region failover. This calculation can be heavy as its complexity goes up with 
> execution edge count.
> We tested it in a sample case with 8000 vertices and 16,000,000 edges. It 
> takes ~2.0s to calculate vertices to restart.
> (more details in 
> [https://docs.google.com/document/d/197Ou-01h2obvxq8viKqg4FnOnsykOEKxk3r5WrVBPuA/edit?usp=sharing)]
> That's why we'd propose to cache the region boundary structures to improve 
> the region failover performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13056) Optimize region failover performance on calculating vertices to restart

2019-09-17 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931427#comment-16931427
 ] 

Till Rohrmann commented on FLINK-13056:
---

This sounds very promising [~zhuzh]. Let's try to get it in for Flink 1.10. The 
great thing is that it does not affect any other existing Flink components and 
is self-contained. This should make it easier to merge.

> Optimize region failover performance on calculating vertices to restart
> ---
>
> Key: FLINK-13056
> URL: https://issues.apache.org/jira/browse/FLINK-13056
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.9.0
>Reporter: Zhu Zhu
>Assignee: Zhu Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently some region boundary structures are calculated each time of a 
> region failover. This calculation can be heavy as its complexity goes up with 
> execution edge count.
> We tested it in a sample case with 8000 vertices and 16,000,000 edges. It 
> takes ~2.0s to calculate vertices to restart.
> (more details in 
> [https://docs.google.com/document/d/197Ou-01h2obvxq8viKqg4FnOnsykOEKxk3r5WrVBPuA/edit?usp=sharing)]
> That's why we'd propose to cache the region boundary structures to improve 
> the region failover performance.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13056) Optimize region failover performance on calculating vertices to restart

2019-09-16 Thread Zhu Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930446#comment-16930446
 ] 

Zhu Zhu commented on FLINK-13056:
-

Hi [~till.rohrmann], a FastRestartPipelinedRegionStrategy is added. 
I compared the performance for the 2 most common scenarios as well. The result 
is updated in [this 
doc|https://docs.google.com/document/d/1-QLxe4FXqXBuxlYsNmNU-R21euoTkzk1JAS6Lvrd-F4/edit#].
 

I think it is very beneficial for streaming jobs which has only pipelined 
edges. The cost to cache boundary is 0 and the failover processing time reduces 
significantly to almost 0 ms. The only cost is a bit more time to build the 
regions.
For jobs with blocking edges, the failover is still much faster. 

I also applied a local patch to make the new strategy work with the adapted 
region failover strategy (see test case 4). Using current strategy the failover 
decision time cost is ~1800ms for a 2000 parallelism job with a simple topology 
like {A--(pipelined)-->B}. Using the new strategy the failover decision time 
cost is ~0s.

So I think making the faster version as default for streaming jobs should be a 
better choice.


> Optimize region failover performance on calculating vertices to restart
> ---
>
> Key: FLINK-13056
> URL: https://issues.apache.org/jira/browse/FLINK-13056
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.9.0
>Reporter: Zhu Zhu
>Assignee: Zhu Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently some region boundary structures are calculated each time of a 
> region failover. This calculation can be heavy as its complexity goes up with 
> execution edge count.
> We tested it in a sample case with 8000 vertices and 16,000,000 edges. It 
> takes ~2.0s to calculate vertices to restart.
> (more details in 
> [https://docs.google.com/document/d/197Ou-01h2obvxq8viKqg4FnOnsykOEKxk3r5WrVBPuA/edit?usp=sharing)]
> That's why we'd propose to cache the region boundary structures to improve 
> the region failover performance.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13056) Optimize region failover performance on calculating vertices to restart

2019-08-16 Thread Zhu Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909096#comment-16909096
 ] 

Zhu Zhu commented on FLINK-13056:
-

The diff can be found at 
[https://github.com/zhuzhurk/flink/commit/4f7da57b218e9ccd86f468f9ece62ee1e378ceda].

Need to mention that this diff is based on the initial version of 
flip1.RestartPipelinedRegionStrategy. So it cannot be applied to latest 
flip1.RestartPipelinedRegionStrategy directly, as the region building was 
refactored out from it later(for partition releasing).

The perf test case(RegionFailoverPerfTest#complexPerfTest) used can be found in 
the same branch.

> Optimize region failover performance on calculating vertices to restart
> ---
>
> Key: FLINK-13056
> URL: https://issues.apache.org/jira/browse/FLINK-13056
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.9.0
>Reporter: Zhu Zhu
>Assignee: Zhu Zhu
>Priority: Major
>
> Currently some region boundary structures are calculated each time of a 
> region failover. This calculation can be heavy as its complexity goes up with 
> execution edge count.
> We tested it in a sample case with 8000 vertices and 16,000,000 edges. It 
> takes ~2.0s to calculate vertices to restart.
> (more details in 
> [https://docs.google.com/document/d/197Ou-01h2obvxq8viKqg4FnOnsykOEKxk3r5WrVBPuA/edit?usp=sharing)]
> That's why we'd propose to cache the region boundary structures to improve 
> the region failover performance.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-13056) Optimize region failover performance on calculating vertices to restart

2019-08-16 Thread Till Rohrmann (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909021#comment-16909021
 ] 

Till Rohrmann commented on FLINK-13056:
---

How exactly would it change the failover region computation? Maybe you could 
share some code examples. Maybe we could make it pluggable so that the user can 
choose which version she prefers.

> Optimize region failover performance on calculating vertices to restart
> ---
>
> Key: FLINK-13056
> URL: https://issues.apache.org/jira/browse/FLINK-13056
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.9.0
>Reporter: Zhu Zhu
>Assignee: Zhu Zhu
>Priority: Major
>
> Currently some region boundary structures are calculated each time of a 
> region failover. This calculation can be heavy as its complexity goes up with 
> execution edge count.
> We tested it in a sample case with 8000 vertices and 16,000,000 edges. It 
> takes ~2.0s to calculate vertices to restart.
> (more details in 
> [https://docs.google.com/document/d/197Ou-01h2obvxq8viKqg4FnOnsykOEKxk3r5WrVBPuA/edit?usp=sharing)]
> That's why we'd propose to cache the region boundary structures to improve 
> the region failover performance.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-13056) Optimize region failover performance on calculating vertices to restart

2019-08-16 Thread Zhu Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909001#comment-16909001
 ] 

Zhu Zhu commented on FLINK-13056:
-

[~till.rohrmann] we tried to optimize it, at the cost of more data cached and 
slowing down the region building time.

Taking the sample case with 8000 vertices and 16,000,000 edges as an example.

The failover time reduced from 1961ms to 110ms.

The region building time increases from 523ms to 5681ms as a side effect.

[https://docs.google.com/document/d/1-QLxe4FXqXBuxlYsNmNU-R21euoTkzk1JAS6Lvrd-F4/edit]

> Optimize region failover performance on calculating vertices to restart
> ---
>
> Key: FLINK-13056
> URL: https://issues.apache.org/jira/browse/FLINK-13056
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.9.0
>Reporter: Zhu Zhu
>Assignee: Zhu Zhu
>Priority: Major
>
> Currently some region boundary structures are calculated each time of a 
> region failover. This calculation can be heavy as its complexity goes up with 
> execution edge count.
> We tested it in a sample case with 8000 vertices and 16,000,000 edges. It 
> takes ~2.0s to calculate vertices to restart.
> (more details in 
> [https://docs.google.com/document/d/197Ou-01h2obvxq8viKqg4FnOnsykOEKxk3r5WrVBPuA/edit?usp=sharing)]
> That's why we'd propose to cache the region boundary structures to improve 
> the region failover performance.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-13056) Optimize region failover performance on calculating vertices to restart

2019-08-16 Thread Chesnay Schepler (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908955#comment-16908955
 ] 

Chesnay Schepler commented on FLINK-13056:
--

I'm removing this as a subtask of FLINK-4256 so we can mark that one as 
finished (since it's weird if a FLIP is marked as release but the JIRA still in 
progress).

> Optimize region failover performance on calculating vertices to restart
> ---
>
> Key: FLINK-13056
> URL: https://issues.apache.org/jira/browse/FLINK-13056
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.9.0
>Reporter: Zhu Zhu
>Assignee: Zhu Zhu
>Priority: Major
>
> Currently some region boundary structures are calculated each time of a 
> region failover. This calculation can be heavy as its complexity goes up with 
> execution edge count.
> We tested it in a sample case with 8000 vertices and 16,000,000 edges. It 
> takes ~2.0s to calculate vertices to restart.
> (more details in 
> [https://docs.google.com/document/d/197Ou-01h2obvxq8viKqg4FnOnsykOEKxk3r5WrVBPuA/edit?usp=sharing)]
> That's why we'd propose to cache the region boundary structures to improve 
> the region failover performance.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-13056) Optimize region failover performance on calculating vertices to restart

2019-08-16 Thread Till Rohrmann (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908946#comment-16908946
 ] 

Till Rohrmann commented on FLINK-13056:
---

How exactly do you want to cache the region boundary structures to improve the 
performance [~zhuzh]?

> Optimize region failover performance on calculating vertices to restart
> ---
>
> Key: FLINK-13056
> URL: https://issues.apache.org/jira/browse/FLINK-13056
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.9.0
>Reporter: Zhu Zhu
>Assignee: Zhu Zhu
>Priority: Major
>
> Currently some region boundary structures are calculated each time of a 
> region failover. This calculation can be heavy as its complexity goes up with 
> execution edge count.
> We tested it in a sample case with 8000 vertices and 16,000,000 edges. It 
> takes ~2.0s to calculate vertices to restart.
> (more details in 
> [https://docs.google.com/document/d/197Ou-01h2obvxq8viKqg4FnOnsykOEKxk3r5WrVBPuA/edit?usp=sharing)]
> That's why we'd propose to cache the region boundary structures to improve 
> the region failover performance.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)