[ 
https://issues.apache.org/jira/browse/FLINK-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396636#comment-17396636
 ] 

Zhu Zhu edited comment on FLINK-23593 at 8/10/21, 12:09 PM:
------------------------------------------------------------

I think I find the cause of the regression.

*Cause*
The regression happens because FLINK-23372 disables slot sharing of batch job 
tasks. And a default MiniCluster would just provide 1 task manager with 1 slot.
This means that the two source tasks of {{sortedTwoInput}} were able to run 
simultaneously before FLINK-23372 and had to run sequentially after FLINK-23372 
was merged. This increased the total execution time and resulted in the 
regression.

Later on 07-20, an 
[improvement|https://github.com/apache/flink-benchmarks/commit/70d9b7b4927fc38ecf0950e55a47325b71e2dd63]
 was made on flink-benchmarks and changed the MiniCluster to be pre-launched 
with 1 task manager with 4 slots. This enabled the two source tasks of 
{{sortedTwoInput}} to run simultaneously again. And the regression was gone. 
And that's why we cannot reproduce the regression by reverting FLINK-23372 on 
latest master.

This also explains that 
- why the regression only happened to {{sortedTwoInput}} and 
{{sortedMultiInput}} and not to {{sortedOneInput}}. 
- why the performance increased on 07-20 and it also only happened to 
{{sortedTwoInput}} and {{sortedMultiInput}}

*Conclusion*
It is expected that more slots may be needed for a batch job to run tasks 
simultaneously. However, this does not mean more resources are needed because 
theoretically each slot can be smaller because it is no longer shared. 
Therefore, this regression is expected and acceptable.

*Attachment*
||Benchmark||Score||Link||
|07-15 sortedTwoInput before 
FLINK-23372|1904.626380|[#418|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/418/]|
|07-15 sortedTwoInput after 
FLINK-23372|1782.644331|[#419|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/419/]|
|07-20 sortedTwoInput before 
FLINK-23372|1964.448112|[#420|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/420/]|
|07-20 sortedTwoInput after 
FLINK-23372|1944.880662|[#421|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/421/]|




was (Author: zhuzh):
I think I find the cause of the regression.

*Cause*
The regression happens because FLINK-23372 disables slot sharing of batch job 
tasks. And a default MiniCluster would just provide 1 task manager with 1 slot.
This means that the two source tasks of {{sortedTwoInput}} were able to run 
simultaneously before FLINK-23372 and had to run sequentially after FLINK-23372 
was merged. The increased the total execution time and resulted in the 
regression.

Later on 07-20, an 
[improvement|https://github.com/apache/flink-benchmarks/commit/70d9b7b4927fc38ecf0950e55a47325b71e2dd63]
 was made on flink-benchmarks and changed the MiniCluster to be pre-launched 
with 1 task manager with 4 slots. This enabled the two source tasks of 
{{sortedTwoInput}} to run simultaneously again. And the regression was gone. 
And that's why we cannot reproduce the regression by reverting FLINK-23372 on 
latest master.

This also explains that 
- why the regression only happened to {{sortedTwoInput}} and 
{{sortedMultiInput}} and not to {{sortedOneInput}}. 
- why the performance increased on 07-20 and it also only happened to 
{{sortedTwoInput}} and {{sortedMultiInput}}

*Conclusion*
It is expected that more slots may be needed for a batch job to run tasks 
simultaneously. However, this does not mean more resources are needed because 
theoretically each slot can be smaller because it is no longer shared. 
Therefore, this regression is expected and acceptable.

*Attachment*
||Benchmark||Score||Link||
|07-15 sortedTwoInput before 
FLINK-23372|1904.626380|[#418|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/418/]|
|07-15 sortedTwoInput after 
FLINK-23372|1782.644331|[#419|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/419/]|
|07-20 sortedTwoInput before 
FLINK-23372|1964.448112|[#420|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/420/]|
|07-20 sortedTwoInput after 
FLINK-23372|1944.880662|[#421|http://codespeed.dak8s.net:8080/job/flink-benchmark-request/421/]|



> Performance regression on 15.07.2021
> ------------------------------------
>
>                 Key: FLINK-23593
>                 URL: https://issues.apache.org/jira/browse/FLINK-23593
>             Project: Flink
>          Issue Type: Bug
>          Components: API / DataStream, Benchmarks
>    Affects Versions: 1.14.0
>            Reporter: Piotr Nowojski
>            Assignee: Timo Walther
>            Priority: Blocker
>             Fix For: 1.14.0
>
>
> http://codespeed.dak8s.net:8000/timeline/?ben=sortedMultiInput&env=2
> http://codespeed.dak8s.net:8000/timeline/?ben=sortedTwoInput&env=2
> {noformat}
> pnowojski@piotr-mbp: [~/flink -  ((no branch, bisect started on pr/16589))] $ 
> git ls f4afbf3e7de..eb8100f7afe
> eb8100f7afe [4 weeks ago] (pn/bad, bad, refs/bisect/bad) 
> [FLINK-22017][coordination] Allow BLOCKING result partition to be 
> individually consumable [Thesharing]
> d2005268b1e [4 weeks ago] (HEAD, pn/bisect-4, bisect-4) 
> [FLINK-22017][coordination] Get the ConsumedPartitionGroup that 
> IntermediateResultPartition and DefaultResultPartition belong to [Thesharing]
> d8b1a6fd368 [3 weeks ago] [FLINK-23372][streaming-java] Disable 
> AllVerticesInSameSlotSharingGroupByDefault in batch mode [Timo Walther]
> 4a78097d038 [3 weeks ago] (pn/bisect-3, bisect-3, 
> refs/bisect/good-4a78097d0385749daceafd8326930c8cc5f26f1a) 
> [FLINK-21928][clients][runtime] Introduce static method constructors of 
> DuplicateJobSubmissionException for better readability. [David Moravek]
> 172b9e32215 [3 weeks ago] [FLINK-21928][clients] JobManager failover should 
> succeed, when trying to resubmit already terminated job in application mode. 
> [David Moravek]
> f483008db86 [3 weeks ago] [FLINK-21928][core] Introduce 
> org.apache.flink.util.concurrent.FutureUtils#handleException method, that 
> allows future to recover from the specied exception. [David Moravek]
> d7ac08c2ac0 [3 weeks ago] (pn/bisect-2, bisect-2, 
> refs/bisect/good-d7ac08c2ac06b9ff31707f3b8f43c07817814d4f) 
> [FLINK-22843][docs-zh] Document and code are inconsistent [ZhiJie Yang]
> 16c3ea427df [3 weeks ago] [hotfix] Split the final checkpoint related tests 
> to a separate test class. [Yun Gao]
> 31b3d37a22c [7 weeks ago] [FLINK-21089][runtime] Skip the execution of new 
> sources if finished on restore [Yun Gao]
> 20fe062e1b5 [3 weeks ago] [FLINK-21089][runtime] Skip execution for the 
> legacy source task if finished on restore [Yun Gao]
> 874c627114b [3 weeks ago] [FLINK-21089][runtime] Skip the lifecycle method of 
> operators if finished on restore [Yun Gao]
> ceaf24b1d88 [3 weeks ago] (pn/bisect-1, bisect-1, 
> refs/bisect/good-ceaf24b1d881c2345a43f305d40435519a09cec9) [hotfix] Fix 
> isClosed() for operator wrapper and proxy operator close to the operator 
> chain [Yun Gao]
> 41ea591a6db [3 weeks ago] [FLINK-22627][runtime] Remove unused slot request 
> protocol [Yangze Guo]
> 489346b60f8 [3 months ago] [FLINK-22627][runtime] Remove PendingSlotRequest 
> [Yangze Guo]
> 8ffb4d2af36 [3 months ago] [FLINK-22627][runtime] Remove TaskManagerSlot 
> [Yangze Guo]
> 72073741588 [3 months ago] [FLINK-22627][runtime] Remove SlotManagerImpl and 
> its related tests [Yangze Guo]
> bdb3b7541b3 [3 months ago] [hotfix][yarn] Remove unused internal options in 
> YarnConfigOptionsInternal [Yangze Guo]
> a6a9b192eac [3 weeks ago] [FLINK-23201][streaming] Reset alignment only for 
> the currently processed checkpoint [Anton Kalashnikov]
> b35701a35c7 [3 weeks ago] [FLINK-23201][streaming] Calculate checkpoint 
> alignment time only for last started checkpoint [Anton Kalashnikov]
> 3abec22c536 [3 weeks ago] [FLINK-23107][table-runtime] Separate 
> implementation of deduplicate rank from other rank functions [Shuo Cheng]
> 1a195f5cc59 [3 weeks ago] [FLINK-16093][docs-zh] Translate "System Functions" 
> page of "Functions" into Chinese (#16348) [ZhiJie Yang]
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to