Frederic Hemery created FLINK-24756: ---------------------------------------
Summary: Flink metric identifiers contain group variables. Key: FLINK-24756 URL: https://issues.apache.org/jira/browse/FLINK-24756 Project: Flink Issue Type: Improvement Components: Runtime / Metrics Reporter: Frederic Hemery Metric identifiers are built by concatenating the closest {{ComponentMetricGroup}} metric identifier (which is configurable) and the whole hierarchy of groups that have been added. In a monitoring system like Datadog, it poses a challenge because it is tricky to aggregate across metric identifiers. Instead, it relies on the same metric identifier and different tags to distinguish between different timeseries. Using Flink Datadog integration, we get: ||Metric Name||Tags|| |flink.operator.KafkaSourceReader.topic.resources.partition.0.committedOffset|[topic:resources,partition:0]| |flink.operator.KafkaSourceReader.topic.resources.partition.1.committedOffset|[topic:resources,partition:1]| |flink.operator.KafkaSourceReader.topic.resources.partition.2.committedOffset|[topic:resources,partition:2]| |flink.operator.KafkaSourceReader.topic.resources.partition.3.committedOffset|[topic:resources,partition:3]| |flink.operator.KafkaSourceReader.topic.resources.partition.4.committedOffset|[topic:resources,partition:4]| |flink.operator.KafkaSourceReader.topic.resources.partition.5.committedOffset|[topic:resources,partition:5]| |flink.operator.KafkaSourceReader.topic.resources.partition.6.committedOffset|[topic:resources,partition:6]| |...|...| Instead, the native way to represent metrics in Datadog would be: ||Metric Name||Tags|| |flink.operator.KafkaSourceReader.committedOffset|[topic:resources,partition:0]| |flink.operator.KafkaSourceReader.committedOffset|[topic:resources,partition:1]| |flink.operator.KafkaSourceReader.committedOffset|[topic:resources,partition:2]| |flink.operator.KafkaSourceReader.committedOffset|[topic:resources,partition:3]| |flink.operator.KafkaSourceReader.committedOffset|[topic:resources,partition:4]| |flink.operator.KafkaSourceReader.committedOffset|[topic:resources,partition:5]| |flink.operator.KafkaSourceReader.committedOffset|[topic:resources,partition:6]| |...|...| The recommended way to configure the scopes for the {{ComponentMetricGroup}} in [Datadog Docs|https://docs.datadoghq.com/integrations/flink/#metric-collection] is to remove all the scopes from the templates for the same reason. The metric identifier is built from the scopes and the tags are built from the variables. The issue seems to come from groups being part of both the scopes and the user variables. We can override this behavior by creating a custom metric group for user reported metrics but this is impossible to override for metrics reported by Flink itself (in particular [native RocksDB|https://github.com/apache/flink/blob/664fdaeaccf910c587f3478dd80bb327b441e85a/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBNativeMetricMonitor.java#L78-L80] metrics and [Kafka|https://github.com/apache/flink/blob/99c2a415e9eeefafacf70762b6f54070f7911ceb/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/streaming/connectors/kafka/internals/AbstractFetcher.java#L501-L506] metrics). I couldn't think of a simple, clean and backward compatible way to achieve such a change though so I'm looking for feedback. -- This message was sent by Atlassian Jira (v8.3.4#803005)