[jira] [Updated] (HADOOP-17893) Improve PrometheusSink for Namenode and ResourceManager Metrics

ASF GitHub Bot (Jira) Sun, 12 Sep 2021 23:41:11 -0700


     [ 
https://issues.apache.org/jira/browse/HADOOP-17893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ASF GitHub Bot updated HADOOP-17893:
------------------------------------
    Labels: pull-request-available  (was: )

> Improve PrometheusSink for Namenode and ResourceManager Metrics
> ---------------------------------------------------------------
>
>                 Key: HADOOP-17893
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17893
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: metrics
>    Affects Versions: 3.4.0
>            Reporter: Max  Xie
>            Assignee: Max  Xie
>            Priority: Minor
>              Labels: pull-request-available
>         Attachments: HADOOP-17893.01.patch
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> HADOOP-16398 added exporter for hadoop metrics to prometheus. But some of 
> metrics can't be exported  validly. For example like these metrics, 
> 1.  queue metrics for ResourceManager
> {code:java}
> queue_metrics_max_capacity{queue="root.queue1",context="yarn",hostname="rm_host1"}
>  1
> // queue2's metric can't be exported 
> queue_metrics_max_capacity{queue="root.queue2",context="yarn",hostname="rm_host1"}
>  2
> {code}
> It always exported  only one queue's metric because 
> PrometheusMetricsSink$metricLines only cache one metric  if theses metrics 
> have the same name no matter these metrics has different metric tags.
>  
> 2. rpc metrics for Namenode
> Namenode may have rpc metrics with multi port like service-rpc. But because  
> the same reason  as  Issue 1, it wiil lost some rpc metrics if we use 
> PrometheusSink.
> {code:java}
> rpc_rpc_queue_time300s90th_percentile_latency{port="9000",servername="ClientNamenodeProtocol",context="rpc",hostname="nnhost"}
>  0
> // rpc port=9005 metric can't be exported 
> rpc_rpc_queue_time300s90th_percentile_latency{port="9005",servername="ClientNamenodeProtocol",context="rpc",hostname="nnhost"}
>  0
> {code}
> 3. TopMetrics for Namenode
> org.apache.hadoop.hdfs.server.namenode.top.metrics.TopMetrics is a special 
> metric. And I think It is essentially a Summary metric type. TopMetrics 
> record name will according to different user and op ,  which means that these 
> metric will always exist in PrometheusMetricsSink$metricLines and it may 
> cause the risk of its memory leak. We e need to treat it special. 
> {code:java}
> // invaild topmetric export
> # TYPE 
> nn_top_user_op_counts_window_ms_1500000_op_safemode_get_user_hadoop_client_ip_test_com_count
>  counter
> nn_top_user_op_counts_window_ms_1500000_op_safemode_get_user_hadoop_client_ip_test_com_count{context="dfs",hostname="nn_host",op="safemode_get",user="hadoop/client...@test.com"}
>  10
> // it should be 
> # TYPE nn_top_user_op_counts_window_ms_1500000_count counter
> nn_top_user_op_counts_window_ms_1500000_count{context="dfs",hostname="nn_host",op="safemode_get",user="hadoop/client...@test.com"}
>  10{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Updated] (HADOOP-17893) Improve PrometheusSink for Namenode and ResourceManager Metrics

Reply via email to