[jira] [Commented] (FLINK-13056) Optimize region failover performance on calculating vertices to restart
[ https://issues.apache.org/jira/browse/FLINK-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16991173#comment-16991173 ] Zhu Zhu commented on FLINK-13056: - postpone this improvement to 1.11. > Optimize region failover performance on calculating vertices to restart > --- > > Key: FLINK-13056 > URL: https://issues.apache.org/jira/browse/FLINK-13056 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.9.0, 1.10.0 >Reporter: Zhu Zhu >Assignee: Zhu Zhu >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Currently some region boundary structures are calculated each time of a > region failover. This calculation can be heavy as its complexity goes up with > execution edge count. > We tested it in a sample case with 8000 vertices and 16,000,000 edges. It > takes ~2.0s to calculate vertices to restart. > (more details in > [https://docs.google.com/document/d/197Ou-01h2obvxq8viKqg4FnOnsykOEKxk3r5WrVBPuA/edit?usp=sharing)] > That's why we'd propose to cache the region boundary structures to improve > the region failover performance. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13056) Optimize region failover performance on calculating vertices to restart
[ https://issues.apache.org/jira/browse/FLINK-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931427#comment-16931427 ] Till Rohrmann commented on FLINK-13056: --- This sounds very promising [~zhuzh]. Let's try to get it in for Flink 1.10. The great thing is that it does not affect any other existing Flink components and is self-contained. This should make it easier to merge. > Optimize region failover performance on calculating vertices to restart > --- > > Key: FLINK-13056 > URL: https://issues.apache.org/jira/browse/FLINK-13056 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.9.0 >Reporter: Zhu Zhu >Assignee: Zhu Zhu >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Currently some region boundary structures are calculated each time of a > region failover. This calculation can be heavy as its complexity goes up with > execution edge count. > We tested it in a sample case with 8000 vertices and 16,000,000 edges. It > takes ~2.0s to calculate vertices to restart. > (more details in > [https://docs.google.com/document/d/197Ou-01h2obvxq8viKqg4FnOnsykOEKxk3r5WrVBPuA/edit?usp=sharing)] > That's why we'd propose to cache the region boundary structures to improve > the region failover performance. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (FLINK-13056) Optimize region failover performance on calculating vertices to restart
[ https://issues.apache.org/jira/browse/FLINK-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930446#comment-16930446 ] Zhu Zhu commented on FLINK-13056: - Hi [~till.rohrmann], a FastRestartPipelinedRegionStrategy is added. I compared the performance for the 2 most common scenarios as well. The result is updated in [this doc|https://docs.google.com/document/d/1-QLxe4FXqXBuxlYsNmNU-R21euoTkzk1JAS6Lvrd-F4/edit#]. I think it is very beneficial for streaming jobs which has only pipelined edges. The cost to cache boundary is 0 and the failover processing time reduces significantly to almost 0 ms. The only cost is a bit more time to build the regions. For jobs with blocking edges, the failover is still much faster. I also applied a local patch to make the new strategy work with the adapted region failover strategy (see test case 4). Using current strategy the failover decision time cost is ~1800ms for a 2000 parallelism job with a simple topology like {A--(pipelined)-->B}. Using the new strategy the failover decision time cost is ~0s. So I think making the faster version as default for streaming jobs should be a better choice. > Optimize region failover performance on calculating vertices to restart > --- > > Key: FLINK-13056 > URL: https://issues.apache.org/jira/browse/FLINK-13056 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.9.0 >Reporter: Zhu Zhu >Assignee: Zhu Zhu >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Currently some region boundary structures are calculated each time of a > region failover. This calculation can be heavy as its complexity goes up with > execution edge count. > We tested it in a sample case with 8000 vertices and 16,000,000 edges. It > takes ~2.0s to calculate vertices to restart. > (more details in > [https://docs.google.com/document/d/197Ou-01h2obvxq8viKqg4FnOnsykOEKxk3r5WrVBPuA/edit?usp=sharing)] > That's why we'd propose to cache the region boundary structures to improve > the region failover performance. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (FLINK-13056) Optimize region failover performance on calculating vertices to restart
[ https://issues.apache.org/jira/browse/FLINK-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909096#comment-16909096 ] Zhu Zhu commented on FLINK-13056: - The diff can be found at [https://github.com/zhuzhurk/flink/commit/4f7da57b218e9ccd86f468f9ece62ee1e378ceda]. Need to mention that this diff is based on the initial version of flip1.RestartPipelinedRegionStrategy. So it cannot be applied to latest flip1.RestartPipelinedRegionStrategy directly, as the region building was refactored out from it later(for partition releasing). The perf test case(RegionFailoverPerfTest#complexPerfTest) used can be found in the same branch. > Optimize region failover performance on calculating vertices to restart > --- > > Key: FLINK-13056 > URL: https://issues.apache.org/jira/browse/FLINK-13056 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.9.0 >Reporter: Zhu Zhu >Assignee: Zhu Zhu >Priority: Major > > Currently some region boundary structures are calculated each time of a > region failover. This calculation can be heavy as its complexity goes up with > execution edge count. > We tested it in a sample case with 8000 vertices and 16,000,000 edges. It > takes ~2.0s to calculate vertices to restart. > (more details in > [https://docs.google.com/document/d/197Ou-01h2obvxq8viKqg4FnOnsykOEKxk3r5WrVBPuA/edit?usp=sharing)] > That's why we'd propose to cache the region boundary structures to improve > the region failover performance. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-13056) Optimize region failover performance on calculating vertices to restart
[ https://issues.apache.org/jira/browse/FLINK-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909021#comment-16909021 ] Till Rohrmann commented on FLINK-13056: --- How exactly would it change the failover region computation? Maybe you could share some code examples. Maybe we could make it pluggable so that the user can choose which version she prefers. > Optimize region failover performance on calculating vertices to restart > --- > > Key: FLINK-13056 > URL: https://issues.apache.org/jira/browse/FLINK-13056 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.9.0 >Reporter: Zhu Zhu >Assignee: Zhu Zhu >Priority: Major > > Currently some region boundary structures are calculated each time of a > region failover. This calculation can be heavy as its complexity goes up with > execution edge count. > We tested it in a sample case with 8000 vertices and 16,000,000 edges. It > takes ~2.0s to calculate vertices to restart. > (more details in > [https://docs.google.com/document/d/197Ou-01h2obvxq8viKqg4FnOnsykOEKxk3r5WrVBPuA/edit?usp=sharing)] > That's why we'd propose to cache the region boundary structures to improve > the region failover performance. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-13056) Optimize region failover performance on calculating vertices to restart
[ https://issues.apache.org/jira/browse/FLINK-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909001#comment-16909001 ] Zhu Zhu commented on FLINK-13056: - [~till.rohrmann] we tried to optimize it, at the cost of more data cached and slowing down the region building time. Taking the sample case with 8000 vertices and 16,000,000 edges as an example. The failover time reduced from 1961ms to 110ms. The region building time increases from 523ms to 5681ms as a side effect. [https://docs.google.com/document/d/1-QLxe4FXqXBuxlYsNmNU-R21euoTkzk1JAS6Lvrd-F4/edit] > Optimize region failover performance on calculating vertices to restart > --- > > Key: FLINK-13056 > URL: https://issues.apache.org/jira/browse/FLINK-13056 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.9.0 >Reporter: Zhu Zhu >Assignee: Zhu Zhu >Priority: Major > > Currently some region boundary structures are calculated each time of a > region failover. This calculation can be heavy as its complexity goes up with > execution edge count. > We tested it in a sample case with 8000 vertices and 16,000,000 edges. It > takes ~2.0s to calculate vertices to restart. > (more details in > [https://docs.google.com/document/d/197Ou-01h2obvxq8viKqg4FnOnsykOEKxk3r5WrVBPuA/edit?usp=sharing)] > That's why we'd propose to cache the region boundary structures to improve > the region failover performance. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-13056) Optimize region failover performance on calculating vertices to restart
[ https://issues.apache.org/jira/browse/FLINK-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908955#comment-16908955 ] Chesnay Schepler commented on FLINK-13056: -- I'm removing this as a subtask of FLINK-4256 so we can mark that one as finished (since it's weird if a FLIP is marked as release but the JIRA still in progress). > Optimize region failover performance on calculating vertices to restart > --- > > Key: FLINK-13056 > URL: https://issues.apache.org/jira/browse/FLINK-13056 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination >Affects Versions: 1.9.0 >Reporter: Zhu Zhu >Assignee: Zhu Zhu >Priority: Major > > Currently some region boundary structures are calculated each time of a > region failover. This calculation can be heavy as its complexity goes up with > execution edge count. > We tested it in a sample case with 8000 vertices and 16,000,000 edges. It > takes ~2.0s to calculate vertices to restart. > (more details in > [https://docs.google.com/document/d/197Ou-01h2obvxq8viKqg4FnOnsykOEKxk3r5WrVBPuA/edit?usp=sharing)] > That's why we'd propose to cache the region boundary structures to improve > the region failover performance. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-13056) Optimize region failover performance on calculating vertices to restart
[ https://issues.apache.org/jira/browse/FLINK-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908946#comment-16908946 ] Till Rohrmann commented on FLINK-13056: --- How exactly do you want to cache the region boundary structures to improve the performance [~zhuzh]? > Optimize region failover performance on calculating vertices to restart > --- > > Key: FLINK-13056 > URL: https://issues.apache.org/jira/browse/FLINK-13056 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination >Affects Versions: 1.9.0 >Reporter: Zhu Zhu >Assignee: Zhu Zhu >Priority: Major > > Currently some region boundary structures are calculated each time of a > region failover. This calculation can be heavy as its complexity goes up with > execution edge count. > We tested it in a sample case with 8000 vertices and 16,000,000 edges. It > takes ~2.0s to calculate vertices to restart. > (more details in > [https://docs.google.com/document/d/197Ou-01h2obvxq8viKqg4FnOnsykOEKxk3r5WrVBPuA/edit?usp=sharing)] > That's why we'd propose to cache the region boundary structures to improve > the region failover performance. -- This message was sent by Atlassian JIRA (v7.6.14#76016)