[jira] [Updated] (FLINK-9352) In Standalone checkpoint recover mode many jobs with same checkpoint interval cause IO pressure
[ https://issues.apache.org/jira/browse/FLINK-9352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated FLINK-9352: -- Labels: pull-request-available (was: ) > In Standalone checkpoint recover mode many jobs with same checkpoint interval > cause IO pressure > --- > > Key: FLINK-9352 > URL: https://issues.apache.org/jira/browse/FLINK-9352 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing >Affects Versions: 1.5.0, 1.4.2, 1.6.0 >Reporter: vinoyang >Assignee: vinoyang >Priority: Major > Labels: pull-request-available > > currently, the periodic checkpoint coordinator startCheckpointScheduler uses > *baseInterval* as the initialDelay parameter. the *baseInterval* is also the > checkpoint interval. > In standalone checkpoint mode, many jobs config the same checkpoint interval. > When all jobs being recovered (the cluster restart or jobmanager leadership > switched), all jobs' checkpoint period will tend to accordance. All jobs' > CheckpointCoordinator would start and trigger in a approximate time point. > This caused the high IO cost in the same time period in our production > scenario. > I suggest let the scheduleAtFixedRate's initial delay parameter as a API > config which can let user scatter checkpoint in this scenario. > > cc [~StephanEwen] [~Zentol] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-9352) In Standalone checkpoint recover mode many jobs with same checkpoint interval cause IO pressure
[ https://issues.apache.org/jira/browse/FLINK-9352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann updated FLINK-9352: - Affects Version/s: 1.6.0 1.5.0 1.4.2 > In Standalone checkpoint recover mode many jobs with same checkpoint interval > cause IO pressure > --- > > Key: FLINK-9352 > URL: https://issues.apache.org/jira/browse/FLINK-9352 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing >Affects Versions: 1.5.0, 1.4.2, 1.6.0 >Reporter: vinoyang >Assignee: vinoyang >Priority: Major > > currently, the periodic checkpoint coordinator startCheckpointScheduler uses > *baseInterval* as the initialDelay parameter. the *baseInterval* is also the > checkpoint interval. > In standalone checkpoint mode, many jobs config the same checkpoint interval. > When all jobs being recovered (the cluster restart or jobmanager leadership > switched), all jobs' checkpoint period will tend to accordance. All jobs' > CheckpointCoordinator would start and trigger in a approximate time point. > This caused the high IO cost in the same time period in our production > scenario. > I suggest let the scheduleAtFixedRate's initial delay parameter as a API > config which can let user scatter checkpoint in this scenario. > > cc [~StephanEwen] [~Zentol] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-9352) In Standalone checkpoint recover mode many jobs with same checkpoint interval cause IO pressure
[ https://issues.apache.org/jira/browse/FLINK-9352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann updated FLINK-9352: - Issue Type: Improvement (was: Bug) > In Standalone checkpoint recover mode many jobs with same checkpoint interval > cause IO pressure > --- > > Key: FLINK-9352 > URL: https://issues.apache.org/jira/browse/FLINK-9352 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing >Affects Versions: 1.4.2 >Reporter: vinoyang >Assignee: vinoyang >Priority: Critical > > currently, the periodic checkpoint coordinator startCheckpointScheduler uses > *baseInterval* as the initialDelay parameter. the *baseInterval* is also the > checkpoint interval. > In standalone checkpoint mode, many jobs config the same checkpoint interval. > When all jobs being recovered (the cluster restart or jobmanager leadership > switched), all jobs' checkpoint period will tend to accordance. All jobs' > CheckpointCoordinator would start and trigger in a approximate time point. > This caused the high IO cost in the same time period in our production > scenario. > I suggest let the scheduleAtFixedRate's initial delay parameter as a API > config which can let user scatter checkpoint in this scenario. > > cc [~StephanEwen] [~Zentol] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-9352) In Standalone checkpoint recover mode many jobs with same checkpoint interval cause IO pressure
[ https://issues.apache.org/jira/browse/FLINK-9352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Rohrmann updated FLINK-9352: - Priority: Major (was: Critical) > In Standalone checkpoint recover mode many jobs with same checkpoint interval > cause IO pressure > --- > > Key: FLINK-9352 > URL: https://issues.apache.org/jira/browse/FLINK-9352 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing >Affects Versions: 1.5.0, 1.4.2, 1.6.0 >Reporter: vinoyang >Assignee: vinoyang >Priority: Major > > currently, the periodic checkpoint coordinator startCheckpointScheduler uses > *baseInterval* as the initialDelay parameter. the *baseInterval* is also the > checkpoint interval. > In standalone checkpoint mode, many jobs config the same checkpoint interval. > When all jobs being recovered (the cluster restart or jobmanager leadership > switched), all jobs' checkpoint period will tend to accordance. All jobs' > CheckpointCoordinator would start and trigger in a approximate time point. > This caused the high IO cost in the same time period in our production > scenario. > I suggest let the scheduleAtFixedRate's initial delay parameter as a API > config which can let user scatter checkpoint in this scenario. > > cc [~StephanEwen] [~Zentol] -- This message was sent by Atlassian JIRA (v7.6.3#76005)