[ https://issues.apache.org/jira/browse/HADOOP-17893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated HADOOP-17893: ------------------------------------ Labels: pull-request-available (was: ) > Improve PrometheusSink for Namenode and ResourceManager Metrics > --------------------------------------------------------------- > > Key: HADOOP-17893 > URL: https://issues.apache.org/jira/browse/HADOOP-17893 > Project: Hadoop Common > Issue Type: Improvement > Components: metrics > Affects Versions: 3.4.0 > Reporter: Max Xie > Assignee: Max Xie > Priority: Minor > Labels: pull-request-available > Attachments: HADOOP-17893.01.patch > > Time Spent: 10m > Remaining Estimate: 0h > > HADOOP-16398 added exporter for hadoop metrics to prometheus. But some of > metrics can't be exported validly. For example like these metrics, > 1. queue metrics for ResourceManager > {code:java} > queue_metrics_max_capacity{queue="root.queue1",context="yarn",hostname="rm_host1"} > 1 > // queue2's metric can't be exported > queue_metrics_max_capacity{queue="root.queue2",context="yarn",hostname="rm_host1"} > 2 > {code} > It always exported only one queue's metric because > PrometheusMetricsSink$metricLines only cache one metric if theses metrics > have the same name no matter these metrics has different metric tags. > > 2. rpc metrics for Namenode > Namenode may have rpc metrics with multi port like service-rpc. But because > the same reason as Issue 1, it wiil lost some rpc metrics if we use > PrometheusSink. > {code:java} > rpc_rpc_queue_time300s90th_percentile_latency{port="9000",servername="ClientNamenodeProtocol",context="rpc",hostname="nnhost"} > 0 > // rpc port=9005 metric can't be exported > rpc_rpc_queue_time300s90th_percentile_latency{port="9005",servername="ClientNamenodeProtocol",context="rpc",hostname="nnhost"} > 0 > {code} > 3. TopMetrics for Namenode > org.apache.hadoop.hdfs.server.namenode.top.metrics.TopMetrics is a special > metric. And I think It is essentially a Summary metric type. TopMetrics > record name will according to different user and op , which means that these > metric will always exist in PrometheusMetricsSink$metricLines and it may > cause the risk of its memory leak. We e need to treat it special. > {code:java} > // invaild topmetric export > # TYPE > nn_top_user_op_counts_window_ms_1500000_op_safemode_get_user_hadoop_client_ip_test_com_count > counter > nn_top_user_op_counts_window_ms_1500000_op_safemode_get_user_hadoop_client_ip_test_com_count{context="dfs",hostname="nn_host",op="safemode_get",user="hadoop/client...@test.com"} > 10 > // it should be > # TYPE nn_top_user_op_counts_window_ms_1500000_count counter > nn_top_user_op_counts_window_ms_1500000_count{context="dfs",hostname="nn_host",op="safemode_get",user="hadoop/client...@test.com"} > 10{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org