[ https://issues.apache.org/jira/browse/FLINK-21513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Martijn Visser updated FLINK-21513: ----------------------------------- Fix Version/s: (was: 1.16.0) > Rethink up-/down-/restartingTime metrics > ---------------------------------------- > > Key: FLINK-21513 > URL: https://issues.apache.org/jira/browse/FLINK-21513 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Runtime / Metrics > Reporter: Chesnay Schepler > Priority: Minor > Labels: auto-deprioritized-major, reactive > > While thinking about FLINK-21510 I stumbled upon some issues in the the > semantics of these metrics, both from a user perspective and from our own, > and I think we need to clarify some things. > h4. upTime > This metric describes the time since the job transitioned RUNNING state. > It is meant as a measure for how stably a deployment is. > In the default scheduler this transitions happens before we do any actual > scheduling work, and as a result this also includes the time it takes for the > JM to request slots and deploy tasks. In practive this means we start the > timer once the job has been submitted and the JobMaster/Scheduler/EG have > been initialized. > For the adaptive scheduler this now puts us a bit into an odd situation > because it first acquires slots before actually transitioning the EG into a > RUNNING state, so as is we'd end up measuring 2 slightly different things. > The question now is whether this is a problem. > While we could certainly stick with the definition of "time since EG switched > to RUNNING", it raises the question what the semantics of this metric are > should a scheduler use a different data-structure than the EG. > In other words, what I'm looking for is a definition that is independent from > existing data-structures; a crude example could be "The time since the job is > in a state where the deployment of a task is possible.". > An alternative for the adaptive scheduler would be to measure the time since > we transitioned to WaitingForResources, with which we would also include the > slot acquisition, but it would be inconsistent with the logs and UI (because > they only display an INITIALIZING job). > h4. restartingTime > This metric describes the time since the job transitioned into a RESTARTING > state. > It is meant as a measure for how long the recovery in case of a job failure > takes. > In the default scheduler this in practice is the time between a failure > arriving at the JM and the cancellation of tasks being completed / restart > backoff (whichever is higher). > This is consistent with the semantics of the upTime metric, because upTime > also includes the time required for acquiring slots and deploying tasks. > For the adaptive scheduler we can follow similar semantics, by measuring the > time we spend in the {{Restarting}} state. > However, if we stick to the definition of upTime as time spent in RUNNING, > then we will end up with a gap for the time spent in WaitingForResources. > h4. downTime > This metric describes the time between the job transitioning from FAILING to > RUNNING. > It is meant as a measure for how long the recovery in case of a job failure > takes. > You may be wondering what the difference between {{downTime}} and > {{restartingTime}} is meant to be. Unfortunately I do not have the answer to > that. > Presumably, at the time they were added, they were covering different parts > of the recovery process, but since we never documented these steps explicitly > the exact semantics are no longer clear and there are no specs that a > scheduler can follow. > Furthermore, this metric is currently broken because a FAILING job _never_ > transitions into RUNNING anymore. > The default scheduler transitions from RUNNING -> RESTARTING -> RUNNING, > whereas the adaptive scheduler cancels the job and creates a new EG. > As it is we could probably merge downTime and restartingTime because they > seem to cover the exact same thing. -- This message was sent by Atlassian Jira (v8.20.10#820010)