[jira] [Comment Edited] (FLINK-23593) Performance regression on 15.07.2021

Zhu Zhu (Jira) Tue, 10 Aug 2021 20:02:06 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396636#comment-17396636
 ]


Zhu Zhu edited comment on FLINK-23593 at 8/11/21, 3:01 AM:
-----------------------------------------------------------

I think I find the cause of the major regression.

*Cause*
The major regression happens because FLINK-23372 disables slot sharing of batch 
job tasks. And a default MiniCluster would just provide 1 task manager with 1 
slot.
This means that the two source tasks of {{sortedTwoInput}} were able to run 
simultaneously before FLINK-23372 and had to run sequentially after FLINK-23372 
was merged. This increased the total execution time and resulted in the major 
regression.

Later on 07-20, an 
[improvement|https://github.com/apache/flink-benchmarks/commit/70d9b7b4927fc38ecf0950e55a47325b71e2dd63]
 was made on flink-benchmarks and changed the MiniCluster to be pre-launched 
with 1 task manager with 4 slots. This enabled the two source tasks of 
{{sortedTwoInput}} to run simultaneously again. And the major regression was 
gone. And that's why we cannot reproduce the obvious regression by reverting 
FLINK-23372 on latest master.

This also explains that 
- why the obvious regression only happened to {{sortedTwoInput}} and 
{{sortedMultiInput}} and not to {{sortedOneInput}}. 
- why the performance increased on 07-20 and it also only happened to 
{{sortedTwoInput}} and {{sortedMultiInput}}

*Conclusion*
It is expected that more slots may be needed for a batch job to run tasks 
simultaneously. However, this does not mean more resources are needed because 
theoretically each slot can be smaller because it is no longer shared. 
Therefore, this regression is expected and acceptable.

Note that there still seems to be minor regression (~1%) after applying 
FLINK-23372. The possible reason is explained above in Stephan's 
[comment|https://issues.apache.org/jira/browse/FLINK-23593?focusedCommentId=17393287&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17393287].
 It's also acceptable in my opinion.

*Attachment*
||Benchmark||Score||Link||
|07-15 sortedTwoInput sharing (right before 
FLINK-23372)|1904.626380|[#418|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/418/]|
|07-15 sortedTwoInput non-sharing (right after 
FLINK-23372)|1782.644331|[#419|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/419/]|
|07-20 sortedTwoInput sharing (right before 
FLINK-23372)|1964.448112|[#420|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/420/]|
|07-20 sortedTwoInput non-sharing (right after 
FLINK-23372)|1944.880662|[#421|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/421/]|
|latest sortedTwoInput sharing (reverting FLINK-23372) on latest 
master)|1938.716479|[#414|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/414/]|
|latest sortedTwoInput non-sharing on latest 
master|1926.685377|[#413|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/413/]|




was (Author: zhuzh):
I think I find the cause of the major regression.

*Cause*
The major regression happens because FLINK-23372 disables slot sharing of batch 
job tasks. And a default MiniCluster would just provide 1 task manager with 1 
slot.
This means that the two source tasks of {{sortedTwoInput}} were able to run 
simultaneously before FLINK-23372 and had to run sequentially after FLINK-23372 
was merged. This increased the total execution time and resulted in the major 
regression.

Later on 07-20, an 
[improvement|https://github.com/apache/flink-benchmarks/commit/70d9b7b4927fc38ecf0950e55a47325b71e2dd63]
 was made on flink-benchmarks and changed the MiniCluster to be pre-launched 
with 1 task manager with 4 slots. This enabled the two source tasks of 
{{sortedTwoInput}} to run simultaneously again. And the major regression was 
gone. And that's why we cannot reproduce the obvious regression by reverting 
FLINK-23372 on latest master.

This also explains that 
- why the obvious regression only happened to {{sortedTwoInput}} and 
{{sortedMultiInput}} and not to {{sortedOneInput}}. 
- why the performance increased on 07-20 and it also only happened to 
{{sortedTwoInput}} and {{sortedMultiInput}}

*Conclusion*
It is expected that more slots may be needed for a batch job to run tasks 
simultaneously. However, this does not mean more resources are needed because 
theoretically each slot can be smaller because it is no longer shared. 
Therefore, this regression is expected and acceptable.

Note that there still seems to be minor regression (~1%) after applying 
FLINK-23372. The possible reason is explained above in Stephan's 
[comment|https://issues.apache.org/jira/browse/FLINK-23593?focusedCommentId=17393287&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17393287].
 It's also acceptable in my opinion.

*Attachment*
||Benchmark||Score||Link||
|07-15 sortedTwoInput sharing (right before 
FLINK-23372)|1904.626380|[#418|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/418/]|
|07-15 sortedTwoInput non-sharing (right after 
FLINK-23372)|1782.644331|[#419|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/419/]|
|07-20 sortedTwoInput sharing (right before 
FLINK-23372)|1964.448112|[#420|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/420/]|
|07-20 sortedTwoInput non-sharing (right after 
FLINK-23372)|1944.880662|[#421|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/421/]|
|latest sortedTwoInput sharing (reverting FLINK-23372) on latest 
master)|1938.716479|[#414|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/414/]|
|latest sortedTwoInput sharing on latest 
master|1926.685377|[#413|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/413/]|



> Performance regression on 15.07.2021
> ------------------------------------
>
>                 Key: FLINK-23593
>                 URL: https://issues.apache.org/jira/browse/FLINK-23593
>             Project: Flink
>          Issue Type: Bug
>          Components: API / DataStream, Benchmarks
>    Affects Versions: 1.14.0
>            Reporter: Piotr Nowojski
>            Assignee: Timo Walther
>            Priority: Blocker
>             Fix For: 1.14.0
>
>
> http://codespeed.dak8s.net:8000/timeline/?ben=sortedMultiInput&env=2
> http://codespeed.dak8s.net:8000/timeline/?ben=sortedTwoInput&env=2
> {noformat}
> pnowojski@piotr-mbp: [~/flink -  ((no branch, bisect started on pr/16589))] $ 
> git ls f4afbf3e7de..eb8100f7afe
> eb8100f7afe [4 weeks ago] (pn/bad, bad, refs/bisect/bad) 
> [FLINK-22017][coordination] Allow BLOCKING result partition to be 
> individually consumable [Thesharing]
> d2005268b1e [4 weeks ago] (HEAD, pn/bisect-4, bisect-4) 
> [FLINK-22017][coordination] Get the ConsumedPartitionGroup that 
> IntermediateResultPartition and DefaultResultPartition belong to [Thesharing]
> d8b1a6fd368 [3 weeks ago] [FLINK-23372][streaming-java] Disable 
> AllVerticesInSameSlotSharingGroupByDefault in batch mode [Timo Walther]
> 4a78097d038 [3 weeks ago] (pn/bisect-3, bisect-3, 
> refs/bisect/good-4a78097d0385749daceafd8326930c8cc5f26f1a) 
> [FLINK-21928][clients][runtime] Introduce static method constructors of 
> DuplicateJobSubmissionException for better readability. [David Moravek]
> 172b9e32215 [3 weeks ago] [FLINK-21928][clients] JobManager failover should 
> succeed, when trying to resubmit already terminated job in application mode. 
> [David Moravek]
> f483008db86 [3 weeks ago] [FLINK-21928][core] Introduce 
> org.apache.flink.util.concurrent.FutureUtils#handleException method, that 
> allows future to recover from the specied exception. [David Moravek]
> d7ac08c2ac0 [3 weeks ago] (pn/bisect-2, bisect-2, 
> refs/bisect/good-d7ac08c2ac06b9ff31707f3b8f43c07817814d4f) 
> [FLINK-22843][docs-zh] Document and code are inconsistent [ZhiJie Yang]
> 16c3ea427df [3 weeks ago] [hotfix] Split the final checkpoint related tests 
> to a separate test class. [Yun Gao]
> 31b3d37a22c [7 weeks ago] [FLINK-21089][runtime] Skip the execution of new 
> sources if finished on restore [Yun Gao]
> 20fe062e1b5 [3 weeks ago] [FLINK-21089][runtime] Skip execution for the 
> legacy source task if finished on restore [Yun Gao]
> 874c627114b [3 weeks ago] [FLINK-21089][runtime] Skip the lifecycle method of 
> operators if finished on restore [Yun Gao]
> ceaf24b1d88 [3 weeks ago] (pn/bisect-1, bisect-1, 
> refs/bisect/good-ceaf24b1d881c2345a43f305d40435519a09cec9) [hotfix] Fix 
> isClosed() for operator wrapper and proxy operator close to the operator 
> chain [Yun Gao]
> 41ea591a6db [3 weeks ago] [FLINK-22627][runtime] Remove unused slot request 
> protocol [Yangze Guo]
> 489346b60f8 [3 months ago] [FLINK-22627][runtime] Remove PendingSlotRequest 
> [Yangze Guo]
> 8ffb4d2af36 [3 months ago] [FLINK-22627][runtime] Remove TaskManagerSlot 
> [Yangze Guo]
> 72073741588 [3 months ago] [FLINK-22627][runtime] Remove SlotManagerImpl and 
> its related tests [Yangze Guo]
> bdb3b7541b3 [3 months ago] [hotfix][yarn] Remove unused internal options in 
> YarnConfigOptionsInternal [Yangze Guo]
> a6a9b192eac [3 weeks ago] [FLINK-23201][streaming] Reset alignment only for 
> the currently processed checkpoint [Anton Kalashnikov]
> b35701a35c7 [3 weeks ago] [FLINK-23201][streaming] Calculate checkpoint 
> alignment time only for last started checkpoint [Anton Kalashnikov]
> 3abec22c536 [3 weeks ago] [FLINK-23107][table-runtime] Separate 
> implementation of deduplicate rank from other rank functions [Shuo Cheng]
> 1a195f5cc59 [3 weeks ago] [FLINK-16093][docs-zh] Translate "System Functions" 
> page of "Functions" into Chinese (#16348) [ZhiJie Yang]
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-23593) Performance regression on 15.07.2021

Reply via email to