[ https://issues.apache.org/jira/browse/FLINK-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16637332#comment-16637332 ]
Jamie Grier commented on FLINK-10484: ------------------------------------- [~Zentol] Great. I didn't see that this had already been addressed in 1.7. What do you think about the difficulty of backporting to 1.5 and 1.6? Currently, it's a pretty big problem for people trying to run Flink at any reasonable scale – and since latency tracking is on by default basically everything breaks as soon as you upgrade a job from 1.4 to 1.5. Also, latency tracking is something that has to be disabled from application code rather than in the flink-conf.yaml file so it's very hard for infra teams supporting Flink to enforce. It's also not just a problem for Flink JM – but in our case we actually caused an observability incident company wide just because of the sheer volume of metrics being thrown at our metrics servers. > New latency tracking metrics format causes metrics cardinality explosion > ------------------------------------------------------------------------ > > Key: FLINK-10484 > URL: https://issues.apache.org/jira/browse/FLINK-10484 > Project: Flink > Issue Type: Bug > Components: Metrics > Affects Versions: 1.6.0, 1.6.1, 1.5.4 > Reporter: Jamie Grier > Assignee: Jamie Grier > Priority: Critical > > The new metrics format for latency tracking causes huge metrics cardinality > explosion due to the format and the fact that there is a metric reported for > a every combination of source subtask index and operator subtask index. > Yikes! > This format is actually responsible for basically taking down our metrics > system due to DDOSing our metrics servers (at Lyft). > -- This message was sent by Atlassian JIRA (v7.6.3#76005)