[jira] [Comment Edited] (FLINK-23593) Performance regression on 15.07.2021
[ https://issues.apache.org/jira/browse/FLINK-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17392337#comment-17392337 ] Timo Walther edited comment on FLINK-23593 at 8/3/21, 2:35 PM: --- I performed a couple of benchmarks locally for the previously mentioned flags. I don't think that FLINK-23372 caused a major regression. However, it seems we definitely added some regression recently that slows down this benchmark: {code} bb175622e3 (1.13 cut) Benchmark Mode Cnt ScoreError Units SortingBoundedInputBenchmarks.sortedTwoInputNoSlotSharingBlocking thrpt 30 1753.489 ± 15.902 ops/ms SortingBoundedInputBenchmarks.sortedTwoInputNoSlotSharingForwardPipelined thrpt 30 1782.957 ± 21.945 ops/ms SortingBoundedInputBenchmarks.sortedTwoInputNoSlotSharingPipelined thrpt 30 1870.771 ± 50.255 ops/ms SortingBoundedInputBenchmarks.sortedTwoInputSlotSharingBlocking thrpt 30 1836.818 ± 17.767 ops/ms SortingBoundedInputBenchmarks.sortedTwoInputSlotSharingForwardPipelined thrpt 30 1809.482 ± 26.410 ops/ms SortingBoundedInputBenchmarks.sortedTwoInputSlotSharingPipelined thrpt 30 1929.729 ± 21.632 ops/ms d8b1a6fd36 (FLINK-23593) Benchmark Mode Cnt ScoreError Units SortingBoundedInputBenchmarks.sortedTwoInputNoSlotSharingBlocking thrpt 30 1887.372 ± 27.990 ops/ms SortingBoundedInputBenchmarks.sortedTwoInputNoSlotSharingForwardPipelined thrpt 30 1875.029 ± 20.378 ops/ms SortingBoundedInputBenchmarks.sortedTwoInputNoSlotSharingPipelined thrpt 30 1985.825 ± 25.675 ops/ms SortingBoundedInputBenchmarks.sortedTwoInputSlotSharingBlocking thrpt 30 1834.068 ± 48.316 ops/ms SortingBoundedInputBenchmarks.sortedTwoInputSlotSharingForwardPipelined thrpt 30 1833.997 ± 30.467 ops/ms SortingBoundedInputBenchmarks.sortedTwoInputSlotSharingPipelined thrpt 30 2015.552 ± 27.705 ops/ms 6aa0a8a0dd (master) Benchmark Mode Cnt ScoreError Units SortingBoundedInputBenchmarks.sortedTwoInputNoSlotSharingBlocking thrpt 30 1642.628 ± 21.183 ops/ms SortingBoundedInputBenchmarks.sortedTwoInputNoSlotSharingForwardPipelined thrpt 30 1672.128 ± 15.114 ops/ms SortingBoundedInputBenchmarks.sortedTwoInputNoSlotSharingPipelined thrpt 30 1761.725 ± 18.225 ops/ms SortingBoundedInputBenchmarks.sortedTwoInputSlotSharingBlocking thrpt 30 1681.684 ± 17.065 ops/ms SortingBoundedInputBenchmarks.sortedTwoInputSlotSharingForwardPipelined thrpt 30 1689.087 ± 18.509 ops/ms SortingBoundedInputBenchmarks.sortedTwoInputSlotSharingPipelined thrpt 30 1731.022 ± 32.813 ops/ms {code} Branch: https://github.com/twalthr/flink-benchmarks/tree/FLINK-23593 was (Author: twalthr): I performed a couple of benchmarks locally for the previously mentioned flags. I don't think that FLINK-23372 caused a major regression. However, it seems we definitely added some regression recently that slows down this benchmark: {code} bb175622e3 (1.13 cut) Benchmark Mode Cnt ScoreError Units SortingBoundedInputBenchmarks.sortedTwoInputNoSlotSharingBlocking thrpt 30 1753.489 ± 15.902 ops/ms SortingBoundedInputBenchmarks.sortedTwoInputNoSlotSharingForwardPipelined thrpt 30 1782.957 ± 21.945 ops/ms SortingBoundedInputBenchmarks.sortedTwoInputNoSlotSharingPipelined thrpt 30 1870.771 ± 50.255 ops/ms SortingBoundedInputBenchmarks.sortedTwoInputSlotSharingBlocking thrpt 30 1836.818 ± 17.767 ops/ms SortingBoundedInputBenchmarks.sortedTwoInputSlotSharingForwardPipelined thrpt 30 1809.482 ± 26.410 ops/ms SortingBoundedInputBenchmarks.sortedTwoInputSlotSharingPipelined thrpt 30 1929.729 ± 21.632 ops/ms d8b1a6fd36 (FLINK-23593) Benchmark Mode Cnt ScoreError Units SortingBoundedInputBenchmarks.sortedTwoInputNoSlotSharingBlocking thrpt 30 1887.372 ± 27.990 ops/ms SortingBoundedInputBenchmarks.sortedTwoInputNoSlotSharingForwardPipelined thrpt 30 1875.029 ± 20.378 ops/ms SortingBoundedInputBenchmarks.sortedTwoInputNoSlotSharingPipelined thrpt 30 1985.825 ± 25.675 ops/ms SortingBoundedInputBenchmarks.sortedTwoInputSlotSharingBlocking thrpt 30 1834.068 ± 48.316 ops/ms SortingBoundedInputBenchmarks.sortedTwoInputSlotSharingForwardPipelined thrpt 30 1833.997 ± 30.467 ops/ms SortingBoundedInputBenchmarks.sortedTwoInputSlotSharingPipelined thrpt 30 2015.552 ± 27.705 o
[jira] [Comment Edited] (FLINK-23593) Performance regression on 15.07.2021
[ https://issues.apache.org/jira/browse/FLINK-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17392423#comment-17392423 ] Piotr Nowojski edited comment on FLINK-23593 at 8/3/21, 4:58 PM: - Because of https://issues.apache.org/jira/browse/FLINK-23392, https://issues.apache.org/jira/browse/FLINK-23560, you can not compare the results from July 15th to the current results. Also because of various braking changes like https://issues.apache.org/jira/browse/FLINK-23464, you can not use the benchmarking code from current `flink-benchmarks` master to run old `flink` code. You have to use both Flink and flink-benchmarks code from the the time of the regression. I was able quite easily reproduce the regression of FLINK-23392 using flink-benchmarks commit: d816a18 http://codespeed.dak8s.net:8080/job/flink-benchmark-request/345/ (last good, flink commit: 4a78097d038) {noformat} "Benchmark","Mode","Threads","Samples","Score","Score Error (99.9%)","Unit" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedMultiInput","thrpt",1,30,1996.460479,28.904057,"ops/ms" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedOneInput","thrpt",1,30,2337.385239,43.234577,"ops/ms" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedTwoInput","thrpt",1,30,1946.457665,28.919437,"ops/ms" {noformat} http://codespeed.dak8s.net:8080/job/flink-benchmark-request/347/ (first bad, flink commit: d8b1a6fd368) {noformat} "Benchmark","Mode","Threads","Samples","Score","Score Error (99.9%)","Unit" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedMultiInput","thrpt",1,30,1837.391829,23.495855,"ops/ms" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedOneInput","thrpt",1,30,2370.271382,37.804557,"ops/ms" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedTwoInput","thrpt",1,30,1788.425393,22.619503,"ops/ms" {noformat} was (Author: pnowojski): Because of https://issues.apache.org/jira/browse/FLINK-23392, https://issues.apache.org/jira/browse/FLINK-23560, you can not compare the results from July 15th to the current results. Also because of various braking changes like https://issues.apache.org/jira/browse/FLINK-23464, you can not use the benchmarking code from current `flink-benchmarks` master to run old `flink` code. You have to use both Flink and flink-benchmarks code from the the time of the regression. I was able quite easily reproduce the regression of FLINK-23392 using flink-benchmarks commit: d816a18 http://codespeed.dak8s.net:8080/job/flink-benchmark-request/345/ (last good, flink commit: 4a78097d038) http://codespeed.dak8s.net:8080/job/flink-benchmark-request/347/ (first bad, flink commit: d8b1a6fd368) > Performance regression on 15.07.2021 > > > Key: FLINK-23593 > URL: https://issues.apache.org/jira/browse/FLINK-23593 > Project: Flink > Issue Type: Bug > Components: API / DataStream, Benchmarks >Affects Versions: 1.14.0 >Reporter: Piotr Nowojski >Assignee: Timo Walther >Priority: Blocker > Fix For: 1.14.0 > > > http://codespeed.dak8s.net:8000/timeline/?ben=sortedMultiInput&env=2 > http://codespeed.dak8s.net:8000/timeline/?ben=sortedTwoInput&env=2 > {noformat} > pnowojski@piotr-mbp: [~/flink - ((no branch, bisect started on pr/16589))] $ > git ls f4afbf3e7de..eb8100f7afe > eb8100f7afe [4 weeks ago] (pn/bad, bad, refs/bisect/bad) > [FLINK-22017][coordination] Allow BLOCKING result partition to be > individually consumable [Thesharing] > d2005268b1e [4 weeks ago] (HEAD, pn/bisect-4, bisect-4) > [FLINK-22017][coordination] Get the ConsumedPartitionGroup that > IntermediateResultPartition and DefaultResultPartition belong to [Thesharing] > d8b1a6fd368 [3 weeks ago] [FLINK-23372][streaming-java] Disable > AllVerticesInSameSlotSharingGroupByDefault in batch mode [Timo Walther] > 4a78097d038 [3 weeks ago] (pn/bisect-3, bisect-3, > refs/bisect/good-4a78097d0385749daceafd8326930c8cc5f26f1a) > [FLINK-21928][clients][runtime] Introduce static method constructors of > DuplicateJobSubmissionException for better readability. [David Moravek] > 172b9e32215 [3 weeks ago] [FLINK-21928][clients] JobManager failover should > succeed, when trying to resubmit already terminated job in application mode. > [David Moravek] > f483008db86 [3 weeks ago] [FLINK-21928][core] Introduce > org.apache.flink.util.concurrent.FutureUtils#handleException method, that > allows future to recover from the specied exception. [David Moravek] > d7ac08c2ac0 [3 weeks ago] (pn/bisect-2, bisect-2, > refs/bisect/good-d7ac08c2ac06b9ff31707f3b8f43c07817814d4f) > [FLINK-22843][docs-zh] Document and code are inconsistent [ZhiJie Yang] > 16c3ea427df [3 weeks ago] [hotfix] Split the final checkpoint related test
[jira] [Comment Edited] (FLINK-23593) Performance regression on 15.07.2021
[ https://issues.apache.org/jira/browse/FLINK-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17392423#comment-17392423 ] Piotr Nowojski edited comment on FLINK-23593 at 8/3/21, 4:58 PM: - Because of https://issues.apache.org/jira/browse/FLINK-23392, https://issues.apache.org/jira/browse/FLINK-23560, you can not compare the results from July 15th to the current results. Also because of various braking changes like https://issues.apache.org/jira/browse/FLINK-23464, you can not use the benchmarking code from current `flink-benchmarks` master to run old `flink` code. You have to use both Flink and flink-benchmarks code from the the time of the regression. I was able quite easily reproduce the regression of FLINK-23392 using flink-benchmarks commit: d816a18 http://codespeed.dak8s.net:8080/job/flink-benchmark-request/345/ (last good, flink commit: 4a78097d038) {noformat} "Benchmark","Mode","Threads","Samples","Score","Score Error (99.9%)","Unit" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedMultiInput","thrpt",1,30,1996.460479,28.904057,"ops/ms" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedOneInput","thrpt",1,30,2337.385239,43.234577,"ops/ms" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedTwoInput","thrpt",1,30,1946.457665,28.919437,"ops/ms" {noformat} http://codespeed.dak8s.net:8080/job/flink-benchmark-request/347/ (first bad, flink commit: d8b1a6fd368) {noformat} "Benchmark","Mode","Threads","Samples","Score","Score Error (99.9%)","Unit" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedMultiInput","thrpt",1,30,1837.391829,23.495855,"ops/ms" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedOneInput","thrpt",1,30,2370.271382,37.804557,"ops/ms" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedTwoInput","thrpt",1,30,1788.425393,22.619503,"ops/ms" {noformat} (those numbers perfectly align with the performance regression visible in the webUI on 15.07) was (Author: pnowojski): Because of https://issues.apache.org/jira/browse/FLINK-23392, https://issues.apache.org/jira/browse/FLINK-23560, you can not compare the results from July 15th to the current results. Also because of various braking changes like https://issues.apache.org/jira/browse/FLINK-23464, you can not use the benchmarking code from current `flink-benchmarks` master to run old `flink` code. You have to use both Flink and flink-benchmarks code from the the time of the regression. I was able quite easily reproduce the regression of FLINK-23392 using flink-benchmarks commit: d816a18 http://codespeed.dak8s.net:8080/job/flink-benchmark-request/345/ (last good, flink commit: 4a78097d038) {noformat} "Benchmark","Mode","Threads","Samples","Score","Score Error (99.9%)","Unit" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedMultiInput","thrpt",1,30,1996.460479,28.904057,"ops/ms" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedOneInput","thrpt",1,30,2337.385239,43.234577,"ops/ms" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedTwoInput","thrpt",1,30,1946.457665,28.919437,"ops/ms" {noformat} http://codespeed.dak8s.net:8080/job/flink-benchmark-request/347/ (first bad, flink commit: d8b1a6fd368) {noformat} "Benchmark","Mode","Threads","Samples","Score","Score Error (99.9%)","Unit" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedMultiInput","thrpt",1,30,1837.391829,23.495855,"ops/ms" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedOneInput","thrpt",1,30,2370.271382,37.804557,"ops/ms" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedTwoInput","thrpt",1,30,1788.425393,22.619503,"ops/ms" {noformat} > Performance regression on 15.07.2021 > > > Key: FLINK-23593 > URL: https://issues.apache.org/jira/browse/FLINK-23593 > Project: Flink > Issue Type: Bug > Components: API / DataStream, Benchmarks >Affects Versions: 1.14.0 >Reporter: Piotr Nowojski >Assignee: Timo Walther >Priority: Blocker > Fix For: 1.14.0 > > > http://codespeed.dak8s.net:8000/timeline/?ben=sortedMultiInput&env=2 > http://codespeed.dak8s.net:8000/timeline/?ben=sortedTwoInput&env=2 > {noformat} > pnowojski@piotr-mbp: [~/flink - ((no branch, bisect started on pr/16589))] $ > git ls f4afbf3e7de..eb8100f7afe > eb8100f7afe [4 weeks ago] (pn/bad, bad, refs/bisect/bad) > [FLINK-22017][coordination] Allow BLOCKING result partition to be > individually consumable [Thesharing] > d2005268b1e [4 weeks ago] (HEAD, pn/bisect-4, bisect-4) > [FLINK-22017][coordination] Get the ConsumedPartitionGroup that > IntermediateResultPartition and DefaultResultPartition belong to [Thesharing] > d8b1a6fd368 [3 weeks ago] [FLINK-23372][streami
[jira] [Comment Edited] (FLINK-23593) Performance regression on 15.07.2021
[ https://issues.apache.org/jira/browse/FLINK-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17392423#comment-17392423 ] Piotr Nowojski edited comment on FLINK-23593 at 8/3/21, 4:59 PM: - Because of FLINK-23392, FLINK-23560, you can not compare the results from July 15th to the current results. Also because of various braking changes like FLINK-23464, you can not use the benchmarking code from current `flink-benchmarks` master to run old `flink` code. You have to use both Flink and flink-benchmarks code from the the time of the regression. I was able quite easily reproduce the regression of FLINK-23392 using flink-benchmarks commit: d816a18 http://codespeed.dak8s.net:8080/job/flink-benchmark-request/345/ (last good, flink commit: 4a78097d038) {noformat} "Benchmark","Mode","Threads","Samples","Score","Score Error (99.9%)","Unit" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedMultiInput","thrpt",1,30,1996.460479,28.904057,"ops/ms" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedOneInput","thrpt",1,30,2337.385239,43.234577,"ops/ms" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedTwoInput","thrpt",1,30,1946.457665,28.919437,"ops/ms" {noformat} http://codespeed.dak8s.net:8080/job/flink-benchmark-request/347/ (first bad, flink commit: d8b1a6fd368) {noformat} "Benchmark","Mode","Threads","Samples","Score","Score Error (99.9%)","Unit" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedMultiInput","thrpt",1,30,1837.391829,23.495855,"ops/ms" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedOneInput","thrpt",1,30,2370.271382,37.804557,"ops/ms" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedTwoInput","thrpt",1,30,1788.425393,22.619503,"ops/ms" {noformat} (those numbers perfectly align with the performance regression visible in the webUI on 15.07) was (Author: pnowojski): Because of https://issues.apache.org/jira/browse/FLINK-23392, https://issues.apache.org/jira/browse/FLINK-23560, you can not compare the results from July 15th to the current results. Also because of various braking changes like https://issues.apache.org/jira/browse/FLINK-23464, you can not use the benchmarking code from current `flink-benchmarks` master to run old `flink` code. You have to use both Flink and flink-benchmarks code from the the time of the regression. I was able quite easily reproduce the regression of FLINK-23392 using flink-benchmarks commit: d816a18 http://codespeed.dak8s.net:8080/job/flink-benchmark-request/345/ (last good, flink commit: 4a78097d038) {noformat} "Benchmark","Mode","Threads","Samples","Score","Score Error (99.9%)","Unit" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedMultiInput","thrpt",1,30,1996.460479,28.904057,"ops/ms" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedOneInput","thrpt",1,30,2337.385239,43.234577,"ops/ms" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedTwoInput","thrpt",1,30,1946.457665,28.919437,"ops/ms" {noformat} http://codespeed.dak8s.net:8080/job/flink-benchmark-request/347/ (first bad, flink commit: d8b1a6fd368) {noformat} "Benchmark","Mode","Threads","Samples","Score","Score Error (99.9%)","Unit" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedMultiInput","thrpt",1,30,1837.391829,23.495855,"ops/ms" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedOneInput","thrpt",1,30,2370.271382,37.804557,"ops/ms" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedTwoInput","thrpt",1,30,1788.425393,22.619503,"ops/ms" {noformat} (those numbers perfectly align with the performance regression visible in the webUI on 15.07) > Performance regression on 15.07.2021 > > > Key: FLINK-23593 > URL: https://issues.apache.org/jira/browse/FLINK-23593 > Project: Flink > Issue Type: Bug > Components: API / DataStream, Benchmarks >Affects Versions: 1.14.0 >Reporter: Piotr Nowojski >Assignee: Timo Walther >Priority: Blocker > Fix For: 1.14.0 > > > http://codespeed.dak8s.net:8000/timeline/?ben=sortedMultiInput&env=2 > http://codespeed.dak8s.net:8000/timeline/?ben=sortedTwoInput&env=2 > {noformat} > pnowojski@piotr-mbp: [~/flink - ((no branch, bisect started on pr/16589))] $ > git ls f4afbf3e7de..eb8100f7afe > eb8100f7afe [4 weeks ago] (pn/bad, bad, refs/bisect/bad) > [FLINK-22017][coordination] Allow BLOCKING result partition to be > individually consumable [Thesharing] > d2005268b1e [4 weeks ago] (HEAD, pn/bisect-4, bisect-4) > [FLINK-22017][coordination] Get the ConsumedPartitionGroup that > IntermediateResultPartition and DefaultResultPartition belong to [Thesharing] > d8b1a6fd368 [3 weeks ago] [FLINK-23372][streaming-java] Disable > Al
[jira] [Comment Edited] (FLINK-23593) Performance regression on 15.07.2021
[ https://issues.apache.org/jira/browse/FLINK-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17392423#comment-17392423 ] Piotr Nowojski edited comment on FLINK-23593 at 8/3/21, 4:59 PM: - Because of FLINK-23392, FLINK-23560, you can not compare the results from July 15th to the current results. Also because of various braking changes like FLINK-23464, you can not use the benchmarking code from current `flink-benchmarks` master to run old `flink` code. You have to use both Flink and flink-benchmarks code from the the time of the regression. I was able quite easily reproduce the regression from this ticket using flink-benchmarks commit: d816a18 http://codespeed.dak8s.net:8080/job/flink-benchmark-request/345/ (last good, flink commit: 4a78097d038) {noformat} "Benchmark","Mode","Threads","Samples","Score","Score Error (99.9%)","Unit" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedMultiInput","thrpt",1,30,1996.460479,28.904057,"ops/ms" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedOneInput","thrpt",1,30,2337.385239,43.234577,"ops/ms" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedTwoInput","thrpt",1,30,1946.457665,28.919437,"ops/ms" {noformat} http://codespeed.dak8s.net:8080/job/flink-benchmark-request/347/ (first bad, flink commit: d8b1a6fd368) {noformat} "Benchmark","Mode","Threads","Samples","Score","Score Error (99.9%)","Unit" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedMultiInput","thrpt",1,30,1837.391829,23.495855,"ops/ms" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedOneInput","thrpt",1,30,2370.271382,37.804557,"ops/ms" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedTwoInput","thrpt",1,30,1788.425393,22.619503,"ops/ms" {noformat} (those numbers perfectly align with the performance regression visible in the webUI on 15.07) was (Author: pnowojski): Because of FLINK-23392, FLINK-23560, you can not compare the results from July 15th to the current results. Also because of various braking changes like FLINK-23464, you can not use the benchmarking code from current `flink-benchmarks` master to run old `flink` code. You have to use both Flink and flink-benchmarks code from the the time of the regression. I was able quite easily reproduce the regression of FLINK-23392 using flink-benchmarks commit: d816a18 http://codespeed.dak8s.net:8080/job/flink-benchmark-request/345/ (last good, flink commit: 4a78097d038) {noformat} "Benchmark","Mode","Threads","Samples","Score","Score Error (99.9%)","Unit" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedMultiInput","thrpt",1,30,1996.460479,28.904057,"ops/ms" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedOneInput","thrpt",1,30,2337.385239,43.234577,"ops/ms" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedTwoInput","thrpt",1,30,1946.457665,28.919437,"ops/ms" {noformat} http://codespeed.dak8s.net:8080/job/flink-benchmark-request/347/ (first bad, flink commit: d8b1a6fd368) {noformat} "Benchmark","Mode","Threads","Samples","Score","Score Error (99.9%)","Unit" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedMultiInput","thrpt",1,30,1837.391829,23.495855,"ops/ms" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedOneInput","thrpt",1,30,2370.271382,37.804557,"ops/ms" "org.apache.flink.benchmark.SortingBoundedInputBenchmarks.sortedTwoInput","thrpt",1,30,1788.425393,22.619503,"ops/ms" {noformat} (those numbers perfectly align with the performance regression visible in the webUI on 15.07) > Performance regression on 15.07.2021 > > > Key: FLINK-23593 > URL: https://issues.apache.org/jira/browse/FLINK-23593 > Project: Flink > Issue Type: Bug > Components: API / DataStream, Benchmarks >Affects Versions: 1.14.0 >Reporter: Piotr Nowojski >Assignee: Timo Walther >Priority: Blocker > Fix For: 1.14.0 > > > http://codespeed.dak8s.net:8000/timeline/?ben=sortedMultiInput&env=2 > http://codespeed.dak8s.net:8000/timeline/?ben=sortedTwoInput&env=2 > {noformat} > pnowojski@piotr-mbp: [~/flink - ((no branch, bisect started on pr/16589))] $ > git ls f4afbf3e7de..eb8100f7afe > eb8100f7afe [4 weeks ago] (pn/bad, bad, refs/bisect/bad) > [FLINK-22017][coordination] Allow BLOCKING result partition to be > individually consumable [Thesharing] > d2005268b1e [4 weeks ago] (HEAD, pn/bisect-4, bisect-4) > [FLINK-22017][coordination] Get the ConsumedPartitionGroup that > IntermediateResultPartition and DefaultResultPartition belong to [Thesharing] > d8b1a6fd368 [3 weeks ago] [FLINK-23372][streaming-java] Disable > AllVerticesInSameSlotSharingGroupByDefault in batch mode [Timo Walther] > 4a78097d038 [3 weeks ago] (pn/bisect-3, bi
[jira] [Comment Edited] (FLINK-23593) Performance regression on 15.07.2021
[ https://issues.apache.org/jira/browse/FLINK-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17393287#comment-17393287 ] Stephan Ewen edited comment on FLINK-23593 at 8/4/21, 4:00 PM: --- Here is a summary from discussing this offline with [~twalthr]. **Meaningful Change** The general change of behavior is meaningful. Not having tasks share their slots during batch execution means we don't fragment the memory budget as much between different tasks that most likely don't run concurrently anyways. It should give more reliable performance at scale and more predictable behavior by default. **Regression acceptable** We are altering behavior here that has a performance impact, so some amount of change in the benchmarks is expected. In particular, slot sharing is beneficial for small scale: * small data means one slot's memory is enough to accommodate all tasks * fewer slots allocated means a bit less overhead during slot allocation, less bookkeeping. Not slot sharing is beneficial for larger scale: * more memory per operator * means often fewer concurrent tasks so more network buffers per task **Trying to explain the Regression** The executed data flow is pretty much the same in all cases. The tasks and the network stack (local channels, batch shuffles) don't actually care whether they are in one slot or another. My working assumption is that the difference is caused by a few factors in the startup overhead. More slots are required to be allocated, more TM / JM coordination at startup. Another option could be that if the keyed operator (with the sorting) gets its own dedicated slot (when not slot sharing), it gets more memory. The sorter reserves its full share of memory from the MemoryManager, which in turn allocates it at startup (and initializes it to zero). While more memory is generally good, it also has a slightly longer initialization phase. [~zhuzh] could that be an explanation? I think Timo's benchmarks are quite good, comparing slot-sharing vs. not-slot-sharing within the same code snapshot, also relative to the different batch shuffle settings. That's really what we want to understand here. The difference between the slot sharing and not sharing depending on the shuffle modes is pretty small here. {code} Benchmark Mode Cnt ScoreError Units SortingBoundedInputBenchmarks.sortedTwoInputNoSlotSharingBlocking thrpt 30 1642.628 ± 21.183 ops/ms SortingBoundedInputBenchmarks.sortedTwoInputSlotSharingBlocking thrpt 30 1681.684 ± 17.065 ops/ms SortingBoundedInputBenchmarks.sortedTwoInputNoSlotSharingPipelined thrpt 30 1761.725 ± 18.225 ops/ms SortingBoundedInputBenchmarks.sortedTwoInputSlotSharingPipelined thrpt 30 1731.022 ± 32.813 {code} _(Note, I removed the cases with "ForwardPipelined" because it is the same as "Blocking" in that benchmark. There are no forward exchanges, the sink is chained, the sources connect via keyBy())_ It is curious, though, that for pipelined execution, the variant without sharing slots is actually a bit faster. was (Author: stephanewen): Here is a summary from discussing this offline with [~twalthr]. **Meaningful Change** The general change of behavior is meaningful. Not having tasks share their slots during batch execution means we don't fragment the memory budget as much between different tasks that most likely don't run concurrently anyways. It should give more reliable performance at scale and more predictable behavior by default. **Regression acceptable** We are altering behavior here that has a performance impact, so some amount of change in the benchmarks is expected. In particular, slot sharing is beneficial for small scale: * small data means one slot's memory is enough to accommodate all tasks * fewer slots allocated means a bit less overhead during slot allocation, less bookkeeping. Not slot sharing is beneficial for larger scale: * more memory per operator * means often fewer concurrent tasks so more network buffers per task **Trying to explain the Regression** The executed data flow is pretty much the same in all cases. The tasks and the network stack (local channels, batch shuffles) don't actually care whether they are in one slot or another. My working assumption is that the difference is caused by a few factors in the startup overhead. More slots are required to be allocated, more TM / JM coordination at startup. Another option could be that if the keyed operator (with the sorting) gets its own dedicated slot (when not slot sharing), it gets more memory. The sorter reserves its full share of memory from the MemoryManager, which in turn allocates it at startup (and initializes it to zero). While more memory is generally good, it also has a slightly longer i
[jira] [Comment Edited] (FLINK-23593) Performance regression on 15.07.2021
[ https://issues.apache.org/jira/browse/FLINK-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17393950#comment-17393950 ] Zhu Zhu edited comment on FLINK-23593 at 8/5/21, 12:19 PM: --- >> Could the larger difference between local benchmark vs. cloud be that the >> cloud is running with regular HDDs and we always spill to disk because >> SORT_SPILLING_THRESHOLD is set to 0? Maybe yes. Because the record processing time can be shorter on SSD and the increased initialization time(described in *Trying to explain the Regression*) will be more obvious. Another similar suspicion is that the flink-benchmark [patch|https://github.com/twalthr/flink-benchmarks/commit/dfe3cad86030b551daaa7c4a5951a6e4c06fc061] increased `RECORDS_PER_INVOCATION` from 1_500_000 to 3_000_000. This increased processing time and may make the regression on initialization time less obvious. was (Author: zhuzh): >> Could the larger difference between local benchmark vs. cloud be that the >> cloud is running with regular HDDs and we always spill to disk because >> SORT_SPILLING_THRESHOLD is set to 0? Maybe yes. Because the record processing time can be shorter on SSD and the increased initialization time(described in *Trying to explain the Regression*) will be more obvious. Another similar suspicion is that the flink-benchmark [patch|https://github.com/twalthr/flink-benchmarks/commit/dfe3cad86030b551daaa7c4a5951a6e4c06fc061] increased `RECORDS_PER_INVOCATION` from 1_500_000 to 3_000_000. This increased processing time may make the regression on initialization time less obvious. > Performance regression on 15.07.2021 > > > Key: FLINK-23593 > URL: https://issues.apache.org/jira/browse/FLINK-23593 > Project: Flink > Issue Type: Bug > Components: API / DataStream, Benchmarks >Affects Versions: 1.14.0 >Reporter: Piotr Nowojski >Assignee: Timo Walther >Priority: Blocker > Fix For: 1.14.0 > > > http://codespeed.dak8s.net:8000/timeline/?ben=sortedMultiInput&env=2 > http://codespeed.dak8s.net:8000/timeline/?ben=sortedTwoInput&env=2 > {noformat} > pnowojski@piotr-mbp: [~/flink - ((no branch, bisect started on pr/16589))] $ > git ls f4afbf3e7de..eb8100f7afe > eb8100f7afe [4 weeks ago] (pn/bad, bad, refs/bisect/bad) > [FLINK-22017][coordination] Allow BLOCKING result partition to be > individually consumable [Thesharing] > d2005268b1e [4 weeks ago] (HEAD, pn/bisect-4, bisect-4) > [FLINK-22017][coordination] Get the ConsumedPartitionGroup that > IntermediateResultPartition and DefaultResultPartition belong to [Thesharing] > d8b1a6fd368 [3 weeks ago] [FLINK-23372][streaming-java] Disable > AllVerticesInSameSlotSharingGroupByDefault in batch mode [Timo Walther] > 4a78097d038 [3 weeks ago] (pn/bisect-3, bisect-3, > refs/bisect/good-4a78097d0385749daceafd8326930c8cc5f26f1a) > [FLINK-21928][clients][runtime] Introduce static method constructors of > DuplicateJobSubmissionException for better readability. [David Moravek] > 172b9e32215 [3 weeks ago] [FLINK-21928][clients] JobManager failover should > succeed, when trying to resubmit already terminated job in application mode. > [David Moravek] > f483008db86 [3 weeks ago] [FLINK-21928][core] Introduce > org.apache.flink.util.concurrent.FutureUtils#handleException method, that > allows future to recover from the specied exception. [David Moravek] > d7ac08c2ac0 [3 weeks ago] (pn/bisect-2, bisect-2, > refs/bisect/good-d7ac08c2ac06b9ff31707f3b8f43c07817814d4f) > [FLINK-22843][docs-zh] Document and code are inconsistent [ZhiJie Yang] > 16c3ea427df [3 weeks ago] [hotfix] Split the final checkpoint related tests > to a separate test class. [Yun Gao] > 31b3d37a22c [7 weeks ago] [FLINK-21089][runtime] Skip the execution of new > sources if finished on restore [Yun Gao] > 20fe062e1b5 [3 weeks ago] [FLINK-21089][runtime] Skip execution for the > legacy source task if finished on restore [Yun Gao] > 874c627114b [3 weeks ago] [FLINK-21089][runtime] Skip the lifecycle method of > operators if finished on restore [Yun Gao] > ceaf24b1d88 [3 weeks ago] (pn/bisect-1, bisect-1, > refs/bisect/good-ceaf24b1d881c2345a43f305d40435519a09cec9) [hotfix] Fix > isClosed() for operator wrapper and proxy operator close to the operator > chain [Yun Gao] > 41ea591a6db [3 weeks ago] [FLINK-22627][runtime] Remove unused slot request > protocol [Yangze Guo] > 489346b60f8 [3 months ago] [FLINK-22627][runtime] Remove PendingSlotRequest > [Yangze Guo] > 8ffb4d2af36 [3 months ago] [FLINK-22627][runtime] Remove TaskManagerSlot > [Yangze Guo] > 72073741588 [3 months ago] [FLINK-22627][runtime] Remove SlotManagerImpl and > its related tests [Yangze Guo] > bdb3b7541b3 [3 months ago] [hotfix][yarn] Remove unused internal options in > YarnConfigOptionsI
[jira] [Comment Edited] (FLINK-23593) Performance regression on 15.07.2021
[ https://issues.apache.org/jira/browse/FLINK-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17395983#comment-17395983 ] Zhu Zhu edited comment on FLINK-23593 at 8/9/21, 11:20 AM: --- I tried the benchmarks locally before/after applying FLINK-23372 and did not see obvious regression. Also tried benchmarks on commit f4afbf3e7de19ebcc5cb9324a22ba99fcd354dce(last good on [codespeed|http://codespeed.dak8s.net:8000/timeline/?ben=sortedTwoInput&env=2#/?exe=1,3,5&ben=sortedTwoInput&env=2&revs=200&equid=off&quarts=on&extr=on] curve) and eb8100f7afe1cd2b6fceb55b174de097db752fc7(first bad on the curve) but did not reproduce the regression either. Maybe it's due to HDD but I have no idea yet. was (Author: zhuzh): I tried the benchmarks locally before/after applying FLINK-23372 and did not see obvious regression. Also tried benchmarks on commit f4afbf3e7de19ebcc5cb9324a22ba99fcd354dce(last good on [codespeed|http://codespeed.dak8s.net:8000/timeline/?ben=sortedTwoInput&env=2#/?exe=1,3,5&ben=sortedTwoInput&env=2&revs=200&equid=off&quarts=on&extr=on] curve) and eb8100f7afe1cd2b6fceb55b174de097db752fc7(first bad on the curve) but did not reproduce the regression either. Maybe it's due to HDD but I have no idea yet. [2] > Performance regression on 15.07.2021 > > > Key: FLINK-23593 > URL: https://issues.apache.org/jira/browse/FLINK-23593 > Project: Flink > Issue Type: Bug > Components: API / DataStream, Benchmarks >Affects Versions: 1.14.0 >Reporter: Piotr Nowojski >Assignee: Timo Walther >Priority: Blocker > Fix For: 1.14.0 > > > http://codespeed.dak8s.net:8000/timeline/?ben=sortedMultiInput&env=2 > http://codespeed.dak8s.net:8000/timeline/?ben=sortedTwoInput&env=2 > {noformat} > pnowojski@piotr-mbp: [~/flink - ((no branch, bisect started on pr/16589))] $ > git ls f4afbf3e7de..eb8100f7afe > eb8100f7afe [4 weeks ago] (pn/bad, bad, refs/bisect/bad) > [FLINK-22017][coordination] Allow BLOCKING result partition to be > individually consumable [Thesharing] > d2005268b1e [4 weeks ago] (HEAD, pn/bisect-4, bisect-4) > [FLINK-22017][coordination] Get the ConsumedPartitionGroup that > IntermediateResultPartition and DefaultResultPartition belong to [Thesharing] > d8b1a6fd368 [3 weeks ago] [FLINK-23372][streaming-java] Disable > AllVerticesInSameSlotSharingGroupByDefault in batch mode [Timo Walther] > 4a78097d038 [3 weeks ago] (pn/bisect-3, bisect-3, > refs/bisect/good-4a78097d0385749daceafd8326930c8cc5f26f1a) > [FLINK-21928][clients][runtime] Introduce static method constructors of > DuplicateJobSubmissionException for better readability. [David Moravek] > 172b9e32215 [3 weeks ago] [FLINK-21928][clients] JobManager failover should > succeed, when trying to resubmit already terminated job in application mode. > [David Moravek] > f483008db86 [3 weeks ago] [FLINK-21928][core] Introduce > org.apache.flink.util.concurrent.FutureUtils#handleException method, that > allows future to recover from the specied exception. [David Moravek] > d7ac08c2ac0 [3 weeks ago] (pn/bisect-2, bisect-2, > refs/bisect/good-d7ac08c2ac06b9ff31707f3b8f43c07817814d4f) > [FLINK-22843][docs-zh] Document and code are inconsistent [ZhiJie Yang] > 16c3ea427df [3 weeks ago] [hotfix] Split the final checkpoint related tests > to a separate test class. [Yun Gao] > 31b3d37a22c [7 weeks ago] [FLINK-21089][runtime] Skip the execution of new > sources if finished on restore [Yun Gao] > 20fe062e1b5 [3 weeks ago] [FLINK-21089][runtime] Skip execution for the > legacy source task if finished on restore [Yun Gao] > 874c627114b [3 weeks ago] [FLINK-21089][runtime] Skip the lifecycle method of > operators if finished on restore [Yun Gao] > ceaf24b1d88 [3 weeks ago] (pn/bisect-1, bisect-1, > refs/bisect/good-ceaf24b1d881c2345a43f305d40435519a09cec9) [hotfix] Fix > isClosed() for operator wrapper and proxy operator close to the operator > chain [Yun Gao] > 41ea591a6db [3 weeks ago] [FLINK-22627][runtime] Remove unused slot request > protocol [Yangze Guo] > 489346b60f8 [3 months ago] [FLINK-22627][runtime] Remove PendingSlotRequest > [Yangze Guo] > 8ffb4d2af36 [3 months ago] [FLINK-22627][runtime] Remove TaskManagerSlot > [Yangze Guo] > 72073741588 [3 months ago] [FLINK-22627][runtime] Remove SlotManagerImpl and > its related tests [Yangze Guo] > bdb3b7541b3 [3 months ago] [hotfix][yarn] Remove unused internal options in > YarnConfigOptionsInternal [Yangze Guo] > a6a9b192eac [3 weeks ago] [FLINK-23201][streaming] Reset alignment only for > the currently processed checkpoint [Anton Kalashnikov] > b35701a35c7 [3 weeks ago] [FLINK-23201][streaming] Calculate checkpoint > alignment time only for last started checkpoint [Anton Kalashnikov] > 3abec22c536 [3 weeks ago] [FLINK-231
[jira] [Comment Edited] (FLINK-23593) Performance regression on 15.07.2021
[ https://issues.apache.org/jira/browse/FLINK-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17393950#comment-17393950 ] Zhu Zhu edited comment on FLINK-23593 at 8/9/21, 11:22 AM: --- >> Could the larger difference between local benchmark vs. cloud be that the >> cloud is running with regular HDDs and we always spill to disk because >> SORT_SPILLING_THRESHOLD is set to 0? -Maybe yes. Because the record processing time can be shorter on SSD and the increased initialization time(described in *Trying to explain the Regression*) will be more obvious.- Ignore this line because it is wrong. Another similar suspicion is that the flink-benchmark [patch|https://github.com/twalthr/flink-benchmarks/commit/dfe3cad86030b551daaa7c4a5951a6e4c06fc061] increased `RECORDS_PER_INVOCATION` from 1_500_000 to 3_000_000. This increased processing time and may make the regression on initialization time less obvious. was (Author: zhuzh): >> Could the larger difference between local benchmark vs. cloud be that the >> cloud is running with regular HDDs and we always spill to disk because >> SORT_SPILLING_THRESHOLD is set to 0? Maybe yes. Because the record processing time can be shorter on SSD and the increased initialization time(described in *Trying to explain the Regression*) will be more obvious. Another similar suspicion is that the flink-benchmark [patch|https://github.com/twalthr/flink-benchmarks/commit/dfe3cad86030b551daaa7c4a5951a6e4c06fc061] increased `RECORDS_PER_INVOCATION` from 1_500_000 to 3_000_000. This increased processing time and may make the regression on initialization time less obvious. > Performance regression on 15.07.2021 > > > Key: FLINK-23593 > URL: https://issues.apache.org/jira/browse/FLINK-23593 > Project: Flink > Issue Type: Bug > Components: API / DataStream, Benchmarks >Affects Versions: 1.14.0 >Reporter: Piotr Nowojski >Assignee: Timo Walther >Priority: Blocker > Fix For: 1.14.0 > > > http://codespeed.dak8s.net:8000/timeline/?ben=sortedMultiInput&env=2 > http://codespeed.dak8s.net:8000/timeline/?ben=sortedTwoInput&env=2 > {noformat} > pnowojski@piotr-mbp: [~/flink - ((no branch, bisect started on pr/16589))] $ > git ls f4afbf3e7de..eb8100f7afe > eb8100f7afe [4 weeks ago] (pn/bad, bad, refs/bisect/bad) > [FLINK-22017][coordination] Allow BLOCKING result partition to be > individually consumable [Thesharing] > d2005268b1e [4 weeks ago] (HEAD, pn/bisect-4, bisect-4) > [FLINK-22017][coordination] Get the ConsumedPartitionGroup that > IntermediateResultPartition and DefaultResultPartition belong to [Thesharing] > d8b1a6fd368 [3 weeks ago] [FLINK-23372][streaming-java] Disable > AllVerticesInSameSlotSharingGroupByDefault in batch mode [Timo Walther] > 4a78097d038 [3 weeks ago] (pn/bisect-3, bisect-3, > refs/bisect/good-4a78097d0385749daceafd8326930c8cc5f26f1a) > [FLINK-21928][clients][runtime] Introduce static method constructors of > DuplicateJobSubmissionException for better readability. [David Moravek] > 172b9e32215 [3 weeks ago] [FLINK-21928][clients] JobManager failover should > succeed, when trying to resubmit already terminated job in application mode. > [David Moravek] > f483008db86 [3 weeks ago] [FLINK-21928][core] Introduce > org.apache.flink.util.concurrent.FutureUtils#handleException method, that > allows future to recover from the specied exception. [David Moravek] > d7ac08c2ac0 [3 weeks ago] (pn/bisect-2, bisect-2, > refs/bisect/good-d7ac08c2ac06b9ff31707f3b8f43c07817814d4f) > [FLINK-22843][docs-zh] Document and code are inconsistent [ZhiJie Yang] > 16c3ea427df [3 weeks ago] [hotfix] Split the final checkpoint related tests > to a separate test class. [Yun Gao] > 31b3d37a22c [7 weeks ago] [FLINK-21089][runtime] Skip the execution of new > sources if finished on restore [Yun Gao] > 20fe062e1b5 [3 weeks ago] [FLINK-21089][runtime] Skip execution for the > legacy source task if finished on restore [Yun Gao] > 874c627114b [3 weeks ago] [FLINK-21089][runtime] Skip the lifecycle method of > operators if finished on restore [Yun Gao] > ceaf24b1d88 [3 weeks ago] (pn/bisect-1, bisect-1, > refs/bisect/good-ceaf24b1d881c2345a43f305d40435519a09cec9) [hotfix] Fix > isClosed() for operator wrapper and proxy operator close to the operator > chain [Yun Gao] > 41ea591a6db [3 weeks ago] [FLINK-22627][runtime] Remove unused slot request > protocol [Yangze Guo] > 489346b60f8 [3 months ago] [FLINK-22627][runtime] Remove PendingSlotRequest > [Yangze Guo] > 8ffb4d2af36 [3 months ago] [FLINK-22627][runtime] Remove TaskManagerSlot > [Yangze Guo] > 72073741588 [3 months ago] [FLINK-22627][runtime] Remove SlotManagerImpl and > its related tests [Yangze Guo] > bdb3b7541b3 [3 months ago] [hotfix][yarn] Remove unus
[jira] [Comment Edited] (FLINK-23593) Performance regression on 15.07.2021
[ https://issues.apache.org/jira/browse/FLINK-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17395983#comment-17395983 ] Zhu Zhu edited comment on FLINK-23593 at 8/9/21, 11:24 AM: --- I tried the benchmarks locally before/after applying FLINK-23372 and did not see obvious regression. Also tried benchmarks on commit f4afbf3e7de19ebcc5cb9324a22ba99fcd354dce(last good on [codespeed|http://codespeed.dak8s.net:8000/timeline/?ben=sortedTwoInput&env=2#/?exe=1,3,5&ben=sortedTwoInput&env=2&revs=200&equid=off&quarts=on&extr=on] curve) and eb8100f7afe1cd2b6fceb55b174de097db752fc7(first bad on the curve) but did not reproduce the regression either. Maybe it is related to environment differences (e.g. HDD) but I have no idea yet. was (Author: zhuzh): I tried the benchmarks locally before/after applying FLINK-23372 and did not see obvious regression. Also tried benchmarks on commit f4afbf3e7de19ebcc5cb9324a22ba99fcd354dce(last good on [codespeed|http://codespeed.dak8s.net:8000/timeline/?ben=sortedTwoInput&env=2#/?exe=1,3,5&ben=sortedTwoInput&env=2&revs=200&equid=off&quarts=on&extr=on] curve) and eb8100f7afe1cd2b6fceb55b174de097db752fc7(first bad on the curve) but did not reproduce the regression either. Maybe it's due to HDD but I have no idea yet. > Performance regression on 15.07.2021 > > > Key: FLINK-23593 > URL: https://issues.apache.org/jira/browse/FLINK-23593 > Project: Flink > Issue Type: Bug > Components: API / DataStream, Benchmarks >Affects Versions: 1.14.0 >Reporter: Piotr Nowojski >Assignee: Timo Walther >Priority: Blocker > Fix For: 1.14.0 > > > http://codespeed.dak8s.net:8000/timeline/?ben=sortedMultiInput&env=2 > http://codespeed.dak8s.net:8000/timeline/?ben=sortedTwoInput&env=2 > {noformat} > pnowojski@piotr-mbp: [~/flink - ((no branch, bisect started on pr/16589))] $ > git ls f4afbf3e7de..eb8100f7afe > eb8100f7afe [4 weeks ago] (pn/bad, bad, refs/bisect/bad) > [FLINK-22017][coordination] Allow BLOCKING result partition to be > individually consumable [Thesharing] > d2005268b1e [4 weeks ago] (HEAD, pn/bisect-4, bisect-4) > [FLINK-22017][coordination] Get the ConsumedPartitionGroup that > IntermediateResultPartition and DefaultResultPartition belong to [Thesharing] > d8b1a6fd368 [3 weeks ago] [FLINK-23372][streaming-java] Disable > AllVerticesInSameSlotSharingGroupByDefault in batch mode [Timo Walther] > 4a78097d038 [3 weeks ago] (pn/bisect-3, bisect-3, > refs/bisect/good-4a78097d0385749daceafd8326930c8cc5f26f1a) > [FLINK-21928][clients][runtime] Introduce static method constructors of > DuplicateJobSubmissionException for better readability. [David Moravek] > 172b9e32215 [3 weeks ago] [FLINK-21928][clients] JobManager failover should > succeed, when trying to resubmit already terminated job in application mode. > [David Moravek] > f483008db86 [3 weeks ago] [FLINK-21928][core] Introduce > org.apache.flink.util.concurrent.FutureUtils#handleException method, that > allows future to recover from the specied exception. [David Moravek] > d7ac08c2ac0 [3 weeks ago] (pn/bisect-2, bisect-2, > refs/bisect/good-d7ac08c2ac06b9ff31707f3b8f43c07817814d4f) > [FLINK-22843][docs-zh] Document and code are inconsistent [ZhiJie Yang] > 16c3ea427df [3 weeks ago] [hotfix] Split the final checkpoint related tests > to a separate test class. [Yun Gao] > 31b3d37a22c [7 weeks ago] [FLINK-21089][runtime] Skip the execution of new > sources if finished on restore [Yun Gao] > 20fe062e1b5 [3 weeks ago] [FLINK-21089][runtime] Skip execution for the > legacy source task if finished on restore [Yun Gao] > 874c627114b [3 weeks ago] [FLINK-21089][runtime] Skip the lifecycle method of > operators if finished on restore [Yun Gao] > ceaf24b1d88 [3 weeks ago] (pn/bisect-1, bisect-1, > refs/bisect/good-ceaf24b1d881c2345a43f305d40435519a09cec9) [hotfix] Fix > isClosed() for operator wrapper and proxy operator close to the operator > chain [Yun Gao] > 41ea591a6db [3 weeks ago] [FLINK-22627][runtime] Remove unused slot request > protocol [Yangze Guo] > 489346b60f8 [3 months ago] [FLINK-22627][runtime] Remove PendingSlotRequest > [Yangze Guo] > 8ffb4d2af36 [3 months ago] [FLINK-22627][runtime] Remove TaskManagerSlot > [Yangze Guo] > 72073741588 [3 months ago] [FLINK-22627][runtime] Remove SlotManagerImpl and > its related tests [Yangze Guo] > bdb3b7541b3 [3 months ago] [hotfix][yarn] Remove unused internal options in > YarnConfigOptionsInternal [Yangze Guo] > a6a9b192eac [3 weeks ago] [FLINK-23201][streaming] Reset alignment only for > the currently processed checkpoint [Anton Kalashnikov] > b35701a35c7 [3 weeks ago] [FLINK-23201][streaming] Calculate checkpoint > alignment time only for last started checkpoint [Anton Kalashnikov] > 3abec22c5
[jira] [Comment Edited] (FLINK-23593) Performance regression on 15.07.2021
[ https://issues.apache.org/jira/browse/FLINK-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396636#comment-17396636 ] Zhu Zhu edited comment on FLINK-23593 at 8/10/21, 12:08 PM: I think I find the cause of the regression. *Cause* The regression happens because FLINK-23372 disables slot sharing of batch job tasks. And a default MiniCluster would just provide 1 task manager with 1 slot. This means that the two source tasks of {{sortedTwoInput}} were able to run simultaneously before FLINK-23372 and had to run sequentially after FLINK-23372 was merged. The increased the total execution time and resulted in the regression. Later on 07-20, an [improvement|https://github.com/apache/flink-benchmarks/commit/70d9b7b4927fc38ecf0950e55a47325b71e2dd63] was made on flink-benchmarks and changed the MiniCluster to be pre-launched with 1 task manager with 4 slots. This enabled the two source tasks of {{sortedTwoInput}} to run simultaneously again. And the regression was gone. And that's why we cannot reproduce the regression by reverting FLINK-23372 on latest master. This also explains that - why the regression only happened to {{sortedTwoInput}} and {{sortedMultiInput}} and not to {{sortedOneInput}}. - why the performance increased on 07-20 and it also only happened to {{sortedTwoInput}} and {{sortedMultiInput}} *Conclusion* It is expected that more slots may be needed for a batch job to run tasks simultaneously. However, this does not mean more resources are needed because theoretically each slot can be smaller because it is no longer shared. Therefore, this regression is expected and acceptable. *Attachment* ||Benchmark||Score||Link|| |07-15 sortedTwoInput before FLINK-23372|1904.626380|[#418|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/418/]| |07-15 sortedTwoInput after FLINK-23372|1782.644331|[#419|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/419/]| |07-20 sortedTwoInput before FLINK-23372|1964.448112|[#420|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/420/]| |07-20 sortedTwoInput after FLINK-23372|1944.880662|[#421|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/421/]| was (Author: zhuzh): I think I find the cause of the regression. *Cause* The regression happens because FLINK-23372 disables slot sharing of batch job tasks. And a default MiniCluster would just provide 1 task manager with 1 slot. This means that the two source tasks of {{sortedTwoInput}} were able to run simultaneously before FLINK-23372 and had to run sequentially after FLINK-23372 was merged. The increased the total execution time and resulted in the regression. Later on 07-20, an [improvement|https://github.com/apache/flink-benchmarks/commit/70d9b7b4927fc38ecf0950e55a47325b71e2dd63] was made on flink-benchmarks and changed the MiniCluster to be pre-launched with 1 task manager with 4 slots. This enabled the two source tasks of {{sortedTwoInput}} to run simultaneously again. And the regression was gone. And that's why we cannot reproduce the regression by reverting FLINK-23372 on latest master. This also explains that - why the regression only happened to {{sortedTwoInput}} and {{sortedMultiInput}} and not to {{sortedOneInput}}. - why the performance increased on 07-20 and it also only happened to {{sortedTwoInput}} and {{sortedMultiInput}} *Conclusion* It is expected that more slots may be needed for a batch job to run tasks simultaneously. However, this does not mean more resources are needed because theoretically each slot can be smaller because it is no longer shared. Therefore, this regression is expected and acceptable. *Attachment* ||Benchmark||Score||Link|| |07-15 sortedTwoInput before FLINK-23372|1904.626380|[#418|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/418/]| |07-15 sortedTwoInput after FLINK-23372|1782.644331|[#419|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/419/]| |07-20 sortedTwoInput before FLINK-23372|1964.448112|[#420|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/420/]| |07-20 sortedTwoInput after FLINK-23372|1944.880662|[#421|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/421/]| > Performance regression on 15.07.2021 > > > Key: FLINK-23593 > URL: https://issues.apache.org/jira/browse/FLINK-23593 > Project: Flink > Issue Type: Bug > Components: API / DataStream, Benchmarks >Affects Versions: 1.14.0 >Reporter: Piotr Nowojski >Assignee: Timo Walther >Priority: Blocker > Fix For: 1.14.0 > > > http://codespeed.dak8s.net:8000/timeline/?ben=sortedMultiInput&env=2 > http://codespeed.dak8s.net:8000/timeline/?ben=sortedTwoInput&env=2 > {noformat} > pnowojski@piotr-mbp: [~/flink - ((no branch, bisec
[jira] [Comment Edited] (FLINK-23593) Performance regression on 15.07.2021
[ https://issues.apache.org/jira/browse/FLINK-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396636#comment-17396636 ] Zhu Zhu edited comment on FLINK-23593 at 8/10/21, 12:09 PM: I think I find the cause of the regression. *Cause* The regression happens because FLINK-23372 disables slot sharing of batch job tasks. And a default MiniCluster would just provide 1 task manager with 1 slot. This means that the two source tasks of {{sortedTwoInput}} were able to run simultaneously before FLINK-23372 and had to run sequentially after FLINK-23372 was merged. This increased the total execution time and resulted in the regression. Later on 07-20, an [improvement|https://github.com/apache/flink-benchmarks/commit/70d9b7b4927fc38ecf0950e55a47325b71e2dd63] was made on flink-benchmarks and changed the MiniCluster to be pre-launched with 1 task manager with 4 slots. This enabled the two source tasks of {{sortedTwoInput}} to run simultaneously again. And the regression was gone. And that's why we cannot reproduce the regression by reverting FLINK-23372 on latest master. This also explains that - why the regression only happened to {{sortedTwoInput}} and {{sortedMultiInput}} and not to {{sortedOneInput}}. - why the performance increased on 07-20 and it also only happened to {{sortedTwoInput}} and {{sortedMultiInput}} *Conclusion* It is expected that more slots may be needed for a batch job to run tasks simultaneously. However, this does not mean more resources are needed because theoretically each slot can be smaller because it is no longer shared. Therefore, this regression is expected and acceptable. *Attachment* ||Benchmark||Score||Link|| |07-15 sortedTwoInput before FLINK-23372|1904.626380|[#418|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/418/]| |07-15 sortedTwoInput after FLINK-23372|1782.644331|[#419|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/419/]| |07-20 sortedTwoInput before FLINK-23372|1964.448112|[#420|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/420/]| |07-20 sortedTwoInput after FLINK-23372|1944.880662|[#421|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/421/]| was (Author: zhuzh): I think I find the cause of the regression. *Cause* The regression happens because FLINK-23372 disables slot sharing of batch job tasks. And a default MiniCluster would just provide 1 task manager with 1 slot. This means that the two source tasks of {{sortedTwoInput}} were able to run simultaneously before FLINK-23372 and had to run sequentially after FLINK-23372 was merged. The increased the total execution time and resulted in the regression. Later on 07-20, an [improvement|https://github.com/apache/flink-benchmarks/commit/70d9b7b4927fc38ecf0950e55a47325b71e2dd63] was made on flink-benchmarks and changed the MiniCluster to be pre-launched with 1 task manager with 4 slots. This enabled the two source tasks of {{sortedTwoInput}} to run simultaneously again. And the regression was gone. And that's why we cannot reproduce the regression by reverting FLINK-23372 on latest master. This also explains that - why the regression only happened to {{sortedTwoInput}} and {{sortedMultiInput}} and not to {{sortedOneInput}}. - why the performance increased on 07-20 and it also only happened to {{sortedTwoInput}} and {{sortedMultiInput}} *Conclusion* It is expected that more slots may be needed for a batch job to run tasks simultaneously. However, this does not mean more resources are needed because theoretically each slot can be smaller because it is no longer shared. Therefore, this regression is expected and acceptable. *Attachment* ||Benchmark||Score||Link|| |07-15 sortedTwoInput before FLINK-23372|1904.626380|[#418|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/418/]| |07-15 sortedTwoInput after FLINK-23372|1782.644331|[#419|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/419/]| |07-20 sortedTwoInput before FLINK-23372|1964.448112|[#420|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/420/]| |07-20 sortedTwoInput after FLINK-23372|1944.880662|[#421|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/421/]| > Performance regression on 15.07.2021 > > > Key: FLINK-23593 > URL: https://issues.apache.org/jira/browse/FLINK-23593 > Project: Flink > Issue Type: Bug > Components: API / DataStream, Benchmarks >Affects Versions: 1.14.0 >Reporter: Piotr Nowojski >Assignee: Timo Walther >Priority: Blocker > Fix For: 1.14.0 > > > http://codespeed.dak8s.net:8000/timeline/?ben=sortedMultiInput&env=2 > http://codespeed.dak8s.net:8000/timeline/?ben=sortedTwoInput&env=2 > {noformat} > pnowojski@piotr-mbp: [~/flink - ((no branch, bisec
[jira] [Comment Edited] (FLINK-23593) Performance regression on 15.07.2021
[ https://issues.apache.org/jira/browse/FLINK-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396636#comment-17396636 ] Zhu Zhu edited comment on FLINK-23593 at 8/11/21, 2:51 AM: --- I think I find the cause of the major regression. *Cause* The major regression happens because FLINK-23372 disables slot sharing of batch job tasks. And a default MiniCluster would just provide 1 task manager with 1 slot. This means that the two source tasks of {{sortedTwoInput}} were able to run simultaneously before FLINK-23372 and had to run sequentially after FLINK-23372 was merged. This increased the total execution time and resulted in the regression. Later on 07-20, an [improvement|https://github.com/apache/flink-benchmarks/commit/70d9b7b4927fc38ecf0950e55a47325b71e2dd63] was made on flink-benchmarks and changed the MiniCluster to be pre-launched with 1 task manager with 4 slots. This enabled the two source tasks of {{sortedTwoInput}} to run simultaneously again. And the major regression was gone. And that's why we cannot reproduce the obvious regression by reverting FLINK-23372 on latest master. This also explains that - why the obvious regression only happened to {{sortedTwoInput}} and {{sortedMultiInput}} and not to {{sortedOneInput}}. - why the performance increased on 07-20 and it also only happened to {{sortedTwoInput}} and {{sortedMultiInput}} *Conclusion* It is expected that more slots may be needed for a batch job to run tasks simultaneously. However, this does not mean more resources are needed because theoretically each slot can be smaller because it is no longer shared. Therefore, this major regression is expected and acceptable. Note that there still seems to be a minor regression(~1%) after FLINK-23372. The cause may be the increased overhead on slot allocation or memory initialization, as Stephan [commented above|https://issues.apache.org/jira/browse/FLINK-23593?focusedCommentId=17393287&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17393287]. It is also acceptable in my opinion. *Attachment* ||Benchmark||Score||Link|| |07-15 sortedTwoInput before FLINK-23372|1904.626380|[#418|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/418/]| |07-15 sortedTwoInput after FLINK-23372|1782.644331|[#419|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/419/]| |07-20 sortedTwoInput before FLINK-23372|1964.448112|[#420|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/420/]| |07-20 sortedTwoInput after FLINK-23372|1944.880662|[#421|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/421/]| |latest sortedTwoInput on latest master|1926.685377|[#413|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/413/]| |latest sortedTwoInput reverting FLINK-23372 on latest master|1938.716479|[#414|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/414/]| was (Author: zhuzh): I think I find the cause of the regression. *Cause* The regression happens because FLINK-23372 disables slot sharing of batch job tasks. And a default MiniCluster would just provide 1 task manager with 1 slot. This means that the two source tasks of {{sortedTwoInput}} were able to run simultaneously before FLINK-23372 and had to run sequentially after FLINK-23372 was merged. This increased the total execution time and resulted in the regression. Later on 07-20, an [improvement|https://github.com/apache/flink-benchmarks/commit/70d9b7b4927fc38ecf0950e55a47325b71e2dd63] was made on flink-benchmarks and changed the MiniCluster to be pre-launched with 1 task manager with 4 slots. This enabled the two source tasks of {{sortedTwoInput}} to run simultaneously again. And the regression was gone. And that's why we cannot reproduce the regression by reverting FLINK-23372 on latest master. This also explains that - why the regression only happened to {{sortedTwoInput}} and {{sortedMultiInput}} and not to {{sortedOneInput}}. - why the performance increased on 07-20 and it also only happened to {{sortedTwoInput}} and {{sortedMultiInput}} *Conclusion* It is expected that more slots may be needed for a batch job to run tasks simultaneously. However, this does not mean more resources are needed because theoretically each slot can be smaller because it is no longer shared. Therefore, this regression is expected and acceptable. *Attachment* ||Benchmark||Score||Link|| |07-15 sortedTwoInput before FLINK-23372|1904.626380|[#418|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/418/]| |07-15 sortedTwoInput after FLINK-23372|1782.644331|[#419|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/419/]| |07-20 sortedTwoInput before FLINK-23372|1964.448112|[#420|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/420/]| |07-20 sortedTwoInput after FLINK-23372|1944.880662|[#421|http://codespeed.dak8s.net:808
[jira] [Comment Edited] (FLINK-23593) Performance regression on 15.07.2021
[ https://issues.apache.org/jira/browse/FLINK-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396636#comment-17396636 ] Zhu Zhu edited comment on FLINK-23593 at 8/11/21, 2:53 AM: --- I think I find the cause of the regression. *Cause* The regression happens because FLINK-23372 disables slot sharing of batch job tasks. And a default MiniCluster would just provide 1 task manager with 1 slot. This means that the two source tasks of {{sortedTwoInput}} were able to run simultaneously before FLINK-23372 and had to run sequentially after FLINK-23372 was merged. This increased the total execution time and resulted in the regression. Later on 07-20, an [improvement|https://github.com/apache/flink-benchmarks/commit/70d9b7b4927fc38ecf0950e55a47325b71e2dd63] was made on flink-benchmarks and changed the MiniCluster to be pre-launched with 1 task manager with 4 slots. This enabled the two source tasks of {{sortedTwoInput}} to run simultaneously again. And the regression was gone. And that's why we cannot reproduce the regression by reverting FLINK-23372 on latest master. This also explains that - why the regression only happened to {{sortedTwoInput}} and {{sortedMultiInput}} and not to {{sortedOneInput}}. - why the performance increased on 07-20 and it also only happened to {{sortedTwoInput}} and {{sortedMultiInput}} *Conclusion* It is expected that more slots may be needed for a batch job to run tasks simultaneously. However, this does not mean more resources are needed because theoretically each slot can be smaller because it is no longer shared. Therefore, this regression is expected and acceptable. *Attachment* ||Benchmark||Score||Link|| |07-15 sortedTwoInput before FLINK-23372|1904.626380|[#418|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/418/]| |07-15 sortedTwoInput after FLINK-23372|1782.644331|[#419|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/419/]| |07-20 sortedTwoInput before FLINK-23372|1964.448112|[#420|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/420/]| |07-20 sortedTwoInput after FLINK-23372|1944.880662|[#421|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/421/]| |latest sortedTwoInput on latest master|1926.685377|[#413|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/413/]| |latest sortedTwoInput reverting FLINK-23372 on latest master|1938.716479|[#414|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/414/]| was (Author: zhuzh): I think I find the cause of the major regression. *Cause* The major regression happens because FLINK-23372 disables slot sharing of batch job tasks. And a default MiniCluster would just provide 1 task manager with 1 slot. This means that the two source tasks of {{sortedTwoInput}} were able to run simultaneously before FLINK-23372 and had to run sequentially after FLINK-23372 was merged. This increased the total execution time and resulted in the regression. Later on 07-20, an [improvement|https://github.com/apache/flink-benchmarks/commit/70d9b7b4927fc38ecf0950e55a47325b71e2dd63] was made on flink-benchmarks and changed the MiniCluster to be pre-launched with 1 task manager with 4 slots. This enabled the two source tasks of {{sortedTwoInput}} to run simultaneously again. And the major regression was gone. And that's why we cannot reproduce the obvious regression by reverting FLINK-23372 on latest master. This also explains that - why the obvious regression only happened to {{sortedTwoInput}} and {{sortedMultiInput}} and not to {{sortedOneInput}}. - why the performance increased on 07-20 and it also only happened to {{sortedTwoInput}} and {{sortedMultiInput}} *Conclusion* It is expected that more slots may be needed for a batch job to run tasks simultaneously. However, this does not mean more resources are needed because theoretically each slot can be smaller because it is no longer shared. Therefore, this major regression is expected and acceptable. Note that there still seems to be a minor regression(~1%) after FLINK-23372. The cause may be the increased overhead on slot allocation or memory initialization, as Stephan [commented above|https://issues.apache.org/jira/browse/FLINK-23593?focusedCommentId=17393287&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17393287]. It is also acceptable in my opinion. *Attachment* ||Benchmark||Score||Link|| |07-15 sortedTwoInput before FLINK-23372|1904.626380|[#418|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/418/]| |07-15 sortedTwoInput after FLINK-23372|1782.644331|[#419|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/419/]| |07-20 sortedTwoInput before FLINK-23372|1964.448112|[#420|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/420/]| |07-20 sortedTwoInput after FLINK-23372|1944.880662|[#421|http://codespeed.dak8s.net:808
[jira] [Comment Edited] (FLINK-23593) Performance regression on 15.07.2021
[ https://issues.apache.org/jira/browse/FLINK-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396636#comment-17396636 ] Zhu Zhu edited comment on FLINK-23593 at 8/11/21, 3:01 AM: --- I think I find the cause of the major regression. *Cause* The major regression happens because FLINK-23372 disables slot sharing of batch job tasks. And a default MiniCluster would just provide 1 task manager with 1 slot. This means that the two source tasks of {{sortedTwoInput}} were able to run simultaneously before FLINK-23372 and had to run sequentially after FLINK-23372 was merged. This increased the total execution time and resulted in the major regression. Later on 07-20, an [improvement|https://github.com/apache/flink-benchmarks/commit/70d9b7b4927fc38ecf0950e55a47325b71e2dd63] was made on flink-benchmarks and changed the MiniCluster to be pre-launched with 1 task manager with 4 slots. This enabled the two source tasks of {{sortedTwoInput}} to run simultaneously again. And the major regression was gone. And that's why we cannot reproduce the obvious regression by reverting FLINK-23372 on latest master. This also explains that - why the obvious regression only happened to {{sortedTwoInput}} and {{sortedMultiInput}} and not to {{sortedOneInput}}. - why the performance increased on 07-20 and it also only happened to {{sortedTwoInput}} and {{sortedMultiInput}} *Conclusion* It is expected that more slots may be needed for a batch job to run tasks simultaneously. However, this does not mean more resources are needed because theoretically each slot can be smaller because it is no longer shared. Therefore, this regression is expected and acceptable. Note that there still seems to be minor regression (~1%) after applying FLINK-23372. The possible reason is explained above in Stephan's [comment|https://issues.apache.org/jira/browse/FLINK-23593?focusedCommentId=17393287&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17393287]. It's also acceptable in my opinion. *Attachment* ||Benchmark||Score||Link|| |07-15 sortedTwoInput sharing (right before FLINK-23372)|1904.626380|[#418|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/418/]| |07-15 sortedTwoInput non-sharing (right after FLINK-23372)|1782.644331|[#419|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/419/]| |07-20 sortedTwoInput sharing (right before FLINK-23372)|1964.448112|[#420|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/420/]| |07-20 sortedTwoInput non-sharing (right after FLINK-23372)|1944.880662|[#421|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/421/]| |latest sortedTwoInput sharing (reverting FLINK-23372) on latest master)|1938.716479|[#414|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/414/]| |latest sortedTwoInput sharing on latest master|1926.685377|[#413|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/413/]| was (Author: zhuzh): I think I find the cause of the regression. *Cause* The regression happens because FLINK-23372 disables slot sharing of batch job tasks. And a default MiniCluster would just provide 1 task manager with 1 slot. This means that the two source tasks of {{sortedTwoInput}} were able to run simultaneously before FLINK-23372 and had to run sequentially after FLINK-23372 was merged. This increased the total execution time and resulted in the regression. Later on 07-20, an [improvement|https://github.com/apache/flink-benchmarks/commit/70d9b7b4927fc38ecf0950e55a47325b71e2dd63] was made on flink-benchmarks and changed the MiniCluster to be pre-launched with 1 task manager with 4 slots. This enabled the two source tasks of {{sortedTwoInput}} to run simultaneously again. And the regression was gone. And that's why we cannot reproduce the regression by reverting FLINK-23372 on latest master. This also explains that - why the regression only happened to {{sortedTwoInput}} and {{sortedMultiInput}} and not to {{sortedOneInput}}. - why the performance increased on 07-20 and it also only happened to {{sortedTwoInput}} and {{sortedMultiInput}} *Conclusion* It is expected that more slots may be needed for a batch job to run tasks simultaneously. However, this does not mean more resources are needed because theoretically each slot can be smaller because it is no longer shared. Therefore, this regression is expected and acceptable. *Attachment* ||Benchmark||Score||Link|| |07-15 sortedTwoInput before FLINK-23372|1904.626380|[#418|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/418/]| |07-15 sortedTwoInput after FLINK-23372|1782.644331|[#419|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/419/]| |07-20 sortedTwoInput before FLINK-23372|1964.448112|[#420|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/420/]| |07-20 sortedTwoInput after FLINK-23372|194
[jira] [Comment Edited] (FLINK-23593) Performance regression on 15.07.2021
[ https://issues.apache.org/jira/browse/FLINK-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396636#comment-17396636 ] Zhu Zhu edited comment on FLINK-23593 at 8/11/21, 3:01 AM: --- I think I find the cause of the major regression. *Cause* The major regression happens because FLINK-23372 disables slot sharing of batch job tasks. And a default MiniCluster would just provide 1 task manager with 1 slot. This means that the two source tasks of {{sortedTwoInput}} were able to run simultaneously before FLINK-23372 and had to run sequentially after FLINK-23372 was merged. This increased the total execution time and resulted in the major regression. Later on 07-20, an [improvement|https://github.com/apache/flink-benchmarks/commit/70d9b7b4927fc38ecf0950e55a47325b71e2dd63] was made on flink-benchmarks and changed the MiniCluster to be pre-launched with 1 task manager with 4 slots. This enabled the two source tasks of {{sortedTwoInput}} to run simultaneously again. And the major regression was gone. And that's why we cannot reproduce the obvious regression by reverting FLINK-23372 on latest master. This also explains that - why the obvious regression only happened to {{sortedTwoInput}} and {{sortedMultiInput}} and not to {{sortedOneInput}}. - why the performance increased on 07-20 and it also only happened to {{sortedTwoInput}} and {{sortedMultiInput}} *Conclusion* It is expected that more slots may be needed for a batch job to run tasks simultaneously. However, this does not mean more resources are needed because theoretically each slot can be smaller because it is no longer shared. Therefore, this regression is expected and acceptable. Note that there still seems to be minor regression (~1%) after applying FLINK-23372. The possible reason is explained above in Stephan's [comment|https://issues.apache.org/jira/browse/FLINK-23593?focusedCommentId=17393287&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17393287]. It's also acceptable in my opinion. *Attachment* ||Benchmark||Score||Link|| |07-15 sortedTwoInput sharing (right before FLINK-23372)|1904.626380|[#418|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/418/]| |07-15 sortedTwoInput non-sharing (right after FLINK-23372)|1782.644331|[#419|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/419/]| |07-20 sortedTwoInput sharing (right before FLINK-23372)|1964.448112|[#420|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/420/]| |07-20 sortedTwoInput non-sharing (right after FLINK-23372)|1944.880662|[#421|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/421/]| |latest sortedTwoInput sharing (reverting FLINK-23372) on latest master)|1938.716479|[#414|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/414/]| |latest sortedTwoInput non-sharing on latest master|1926.685377|[#413|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/413/]| was (Author: zhuzh): I think I find the cause of the major regression. *Cause* The major regression happens because FLINK-23372 disables slot sharing of batch job tasks. And a default MiniCluster would just provide 1 task manager with 1 slot. This means that the two source tasks of {{sortedTwoInput}} were able to run simultaneously before FLINK-23372 and had to run sequentially after FLINK-23372 was merged. This increased the total execution time and resulted in the major regression. Later on 07-20, an [improvement|https://github.com/apache/flink-benchmarks/commit/70d9b7b4927fc38ecf0950e55a47325b71e2dd63] was made on flink-benchmarks and changed the MiniCluster to be pre-launched with 1 task manager with 4 slots. This enabled the two source tasks of {{sortedTwoInput}} to run simultaneously again. And the major regression was gone. And that's why we cannot reproduce the obvious regression by reverting FLINK-23372 on latest master. This also explains that - why the obvious regression only happened to {{sortedTwoInput}} and {{sortedMultiInput}} and not to {{sortedOneInput}}. - why the performance increased on 07-20 and it also only happened to {{sortedTwoInput}} and {{sortedMultiInput}} *Conclusion* It is expected that more slots may be needed for a batch job to run tasks simultaneously. However, this does not mean more resources are needed because theoretically each slot can be smaller because it is no longer shared. Therefore, this regression is expected and acceptable. Note that there still seems to be minor regression (~1%) after applying FLINK-23372. The possible reason is explained above in Stephan's [comment|https://issues.apache.org/jira/browse/FLINK-23593?focusedCommentId=17393287&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17393287]. It's also acceptable in my opinion. *Attachment* ||Benchmark||Score||Link|| |07-15 sortedTwoInput sharin