[jira] [Commented] (FLINK-10484) New latency tracking metrics format causes metrics cardinality explosion

Chesnay Schepler (JIRA) Wed, 03 Oct 2018 04:54:31 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16636850#comment-16636850
 ]


Chesnay Schepler commented on FLINK-10484:
------------------------------------------

In FLINK-10243 we introduced a switch to reduce the amount of data for the 
latency source, see 
https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html#metrics-latency-granularity.
 This can be used to drastically reduce the number of latency metrics. We could 
look into back-porting this.

The "cardinality explosion" is caused by introducing proper support for custom 
tags, which we used here for consistency purposes as it was always a bit odd 
that you only had a tag for the receiving operator ID, but not the source.
The issue of effectively uncontrollable tags (since they're unaffected by scope 
formats) was raised before, like in FLINK-7935, but I haven't found time to 
address it as it requires a more thorough rework of the internals. All the 
tag-based scope goodies were pretty much tacked on after the fact, and now 
things are scattered all over the place :(

> New latency tracking metrics format causes metrics cardinality explosion
> ------------------------------------------------------------------------
>
>                 Key: FLINK-10484
>                 URL: https://issues.apache.org/jira/browse/FLINK-10484
>             Project: Flink
>          Issue Type: Bug
>          Components: Metrics
>    Affects Versions: 1.6.0, 1.6.1, 1.5.4
>            Reporter: Jamie Grier
>            Assignee: Jamie Grier
>            Priority: Critical
>
> The new metrics format for latency tracking causes huge metrics cardinality 
> explosion due to the format and the fact that there is a metric reported for 
> a every combination of source subtask index and operator subtask index.  
> Yikes!
> This format is actually responsible for basically taking down our metrics 
> system due to DDOSing our metrics servers (at Lyft).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-10484) New latency tracking metrics format causes metrics cardinality explosion

Reply via email to