[ https://issues.apache.org/jira/browse/FLINK-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16636850#comment-16636850 ]
Chesnay Schepler commented on FLINK-10484: ------------------------------------------ In FLINK-10243 we introduced a switch to reduce the amount of data for the latency source, see https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html#metrics-latency-granularity. This can be used to drastically reduce the number of latency metrics. We could look into back-porting this. The "cardinality explosion" is caused by introducing proper support for custom tags, which we used here for consistency purposes as it was always a bit odd that you only had a tag for the receiving operator ID, but not the source. The issue of effectively uncontrollable tags (since they're unaffected by scope formats) was raised before, like in FLINK-7935, but I haven't found time to address it as it requires a more thorough rework of the internals. All the tag-based scope goodies were pretty much tacked on after the fact, and now things are scattered all over the place :( > New latency tracking metrics format causes metrics cardinality explosion > ------------------------------------------------------------------------ > > Key: FLINK-10484 > URL: https://issues.apache.org/jira/browse/FLINK-10484 > Project: Flink > Issue Type: Bug > Components: Metrics > Affects Versions: 1.6.0, 1.6.1, 1.5.4 > Reporter: Jamie Grier > Assignee: Jamie Grier > Priority: Critical > > The new metrics format for latency tracking causes huge metrics cardinality > explosion due to the format and the fact that there is a metric reported for > a every combination of source subtask index and operator subtask index. > Yikes! > This format is actually responsible for basically taking down our metrics > system due to DDOSing our metrics servers (at Lyft). > -- This message was sent by Atlassian JIRA (v7.6.3#76005)