[jira] [Commented] (FLINK-10484) New latency tracking metrics format causes metrics cardinality explosion

Jamie Grier (JIRA) Wed, 03 Oct 2018 11:19:21 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16637332#comment-16637332
 ]


Jamie Grier commented on FLINK-10484:
-------------------------------------

[~Zentol] Great.  I didn't see that this had already been addressed in 1.7.  
What do you think about the difficulty of backporting to 1.5 and 1.6?

Currently, it's a pretty big problem for people trying to run Flink at any 
reasonable scale – and since latency tracking is on by default basically 
everything breaks as soon as you upgrade a job from 1.4 to 1.5.  Also, latency 
tracking is something that has to be disabled from application code rather than 
in the flink-conf.yaml file so it's very hard for infra teams supporting Flink 
to enforce.

It's also not just a problem for Flink JM – but in our case we actually caused 
an observability incident company wide just because of the sheer volume of 
metrics being thrown at our metrics servers.

> New latency tracking metrics format causes metrics cardinality explosion
> ------------------------------------------------------------------------
>
>                 Key: FLINK-10484
>                 URL: https://issues.apache.org/jira/browse/FLINK-10484
>             Project: Flink
>          Issue Type: Bug
>          Components: Metrics
>    Affects Versions: 1.6.0, 1.6.1, 1.5.4
>            Reporter: Jamie Grier
>            Assignee: Jamie Grier
>            Priority: Critical
>
> The new metrics format for latency tracking causes huge metrics cardinality 
> explosion due to the format and the fact that there is a metric reported for 
> a every combination of source subtask index and operator subtask index.  
> Yikes!
> This format is actually responsible for basically taking down our metrics 
> system due to DDOSing our metrics servers (at Lyft).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-10484) New latency tracking metrics format causes metrics cardinality explosion

Reply via email to