[ https://issues.apache.org/jira/browse/FLINK-30623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17682121#comment-17682121 ]
Piotr Nowojski edited comment on FLINK-30623 at 1/30/23 2:38 PM: ----------------------------------------------------------------- Thanks for the analysis [~fanrui] ! {quote} It means 5 subtasks will share the same Unaligned checkpooint file. It will reduce the number of small files, but the UC time will become larger. {quote} What do you think is the actual reason behind the regression? That now we have to enqueue writes from a couple of subtasks one after another, so for example with 2 subtasks, the second has to wait until first completes it's writes? And what do you think is the impact of this setting in a production setups? If this is an issue related to number of shared IO threads, it might be that we only increased the checkpoint time by a small constant ({{{}0.3ms / checkpoint{}}} based on the numbers that [~fanrui] has quoted), that simply remains a constant with more realistic setups. Increase checkpoint time by 0.3ms when checkpoints are taking a couple of seconds, doesn't matter. was (Author: pnowojski): Thanks for the analysis [~fanrui] ! {quote} It means 5 subtasks will share the same Unaligned checkpooint file. It will reduce the number of small files, but the UC time will become larger. {quote} What do you think is the actual reason behind the regression? That now we have to enqueue writes from a couple of subtasks one after another, so for example with 2 subtasks, the second has to wait until first completes it's writes? And what do you think is the impact of this setting in a production setups? If this is an issue related to number of shared IO threads, it might be that we only increased the checkpoint time by a small constant ({{{}0.3ms / checkpoint{}}}), that simply remains a constant with more realistic setups. Increase checkpoint time by 0.3ms when checkpoints are taking a couple of seconds, doesn't matter. > Performance regression in checkpointSingleInput.UNALIGNED on 04.01.2023 > ----------------------------------------------------------------------- > > Key: FLINK-30623 > URL: https://issues.apache.org/jira/browse/FLINK-30623 > Project: Flink > Issue Type: Bug > Components: Benchmarks, Runtime / Checkpointing > Reporter: Martijn Visser > Assignee: Rui Fan > Priority: Blocker > Labels: pull-request-available > Fix For: 1.17.0 > > > Performance regression > checkpointSingleInput.UNALIGNED median=338.1445195 recent_median=67.6453005 > checkpointSingleInput.UNALIGNED_1 median=213.230041 recent_median=39.830277 > deployAllTasks.STREAMING median=168.533106 recent_median=159.8534395 > stateBackends.MEMORY median=3229.0248875 recent_median=2985.782919 > tupleKeyBy median=4155.684199 recent_median=3987.5812305 > http://codespeed.dak8s.net:8000/timeline/#/?exe=1&ben=checkpointSingleInput.UNALIGNED&extr=on&quarts=on&equid=off&env=2&revs=200 > http://codespeed.dak8s.net:8000/timeline/#/?exe=1&ben=checkpointSingleInput.UNALIGNED_1&extr=on&quarts=on&equid=off&env=2&revs=200 > http://codespeed.dak8s.net:8000/timeline/#/?exe=8&ben=deployAllTasks.STREAMING&extr=on&quarts=on&equid=off&env=2&revs=200 > http://codespeed.dak8s.net:8000/timeline/#/?exe=6&ben=stateBackends.MEMORY&extr=on&quarts=on&equid=off&env=2&revs=200 > http://codespeed.dak8s.net:8000/timeline/#/?exe=6&ben=tupleKeyBy&extr=on&quarts=on&equid=off&env=2&revs=200 -- This message was sent by Atlassian Jira (v8.20.10#820010)