Mark Cho created FLINK-10907: -------------------------------- Summary: Job recovery on the same JobManager causes JobManager metrics to report stale values Key: FLINK-10907 URL: https://issues.apache.org/jira/browse/FLINK-10907 Project: Flink Issue Type: Bug Components: Core, Metrics Affects Versions: 1.4.2 Environment: Verified the bug and the fix running on Flink 1.4
Based on the JobManagerMetricGroup.java code in master, this issue should still occur on Flink versions after 1.4. Reporter: Mark Cho * JobManager loses and regains leadership if it loses connection and reconnects to ZooKeeper. * When it regains the leadership, it tries to recover the job graph. * During the recovery, it will try to reuse the existing {{JobManagerMetricGroup}} to register new counters and gauges under the same metric name, which causes the new counters and gauges to be registered incorrectly. * The old counters and gauges will continue to report the stale values and the new counters and gauges will not report the latest metric. Relevant lines from logs {code:java} com.---.JobManager - Submitting recovered job e9e49fd9b8c61cf54b435f39aa49923f. com.---.JobManager - Submitting job e9e49fd9b8c61cf54b435f39aa49923f (flink-job) (Recovery). com.---.JobManager - Running initialization on master for job flink-job (e9e49fd9b8c61cf54b435f39aa49923f). com.---.JobManager - Successfully ran initialization on master in 0 ms. org.apache.flink.metrics.MetricGroup - Name collision: Group already contains a Metric with the name 'totalNumberOfCheckpoints'. Metric will not be reported.[] org.apache.flink.metrics.MetricGroup - Name collision: Group already contains a Metric with the name 'numberOfInProgressCheckpoints'. Metric will not be reported.[] org.apache.flink.metrics.MetricGroup - Name collision: Group already contains a Metric with the name 'numberOfCompletedCheckpoints'. Metric will not be reported.[] org.apache.flink.metrics.MetricGroup - Name collision: Group already contains a Metric with the name 'numberOfFailedCheckpoints'. Metric will not be reported.[] org.apache.flink.metrics.MetricGroup - Name collision: Group already contains a Metric with the name 'lastCheckpointRestoreTimestamp'. Metric will not be reported.[] org.apache.flink.metrics.MetricGroup - Name collision: Group already contains a Metric with the name 'lastCheckpointSize'. Metric will not be reported.[] org.apache.flink.metrics.MetricGroup - Name collision: Group already contains a Metric with the name 'lastCheckpointDuration'. Metric will not be reported.[] org.apache.flink.metrics.MetricGroup - Name collision: Group already contains a Metric with the name 'lastCheckpointAlignmentBuffered'. Metric will not be reported.[] org.apache.flink.metrics.MetricGroup - Name collision: Group already contains a Metric with the name 'lastCheckpointExternalPath'. Metric will not be reported.[] org.apache.flink.metrics.MetricGroup - Name collision: Group already contains a Metric with the name 'restartingTime'. Metric will not be reported.[] org.apache.flink.metrics.MetricGroup - Name collision: Group already contains a Metric with the name 'downtime'. Metric will not be reported.[] org.apache.flink.metrics.MetricGroup - Name collision: Group already contains a Metric with the name 'uptime'. Metric will not be reported.[] org.apache.flink.metrics.MetricGroup - Name collision: Group already contains a Metric with the name 'fullRestarts'. Metric will not be reported.[] org.apache.flink.metrics.MetricGroup - Name collision: Group already contains a Metric with the name 'task_failures'. Metric will not be reported.[] {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)