neils-dev opened a new pull request, #3781:
URL: https://github.com/apache/ozone/pull/3781
## What changes were proposed in this pull request?
To expose metrics from nodes entering the decommissioning and maintenance
workflow to JMX and prom endpoint. These metrics expose the number of
datanodes in the workflow, the container replication state of tracked nodes and
the number of pipelines waiting to close of tracked nodes. With the following
exposed metrics from the `NodeDecommissionManager` through the
`DataAdminMonitorImpl` the progress of the decommission and maintenance
workflow can be monitored.
The progress of datanodes going though the workflow are monitored through
aggregated counts of the number of tracked nodes, their number of pipelines
waiting to close and the number of containers in each of sufficiently,
under-replicated and unhealthy state. The metrics collected are as discussed
in the associated Jira comments,
**As exposed to prom endpoint:**
_aggregated total number of datanodes in workflow:_
`node_decommission_metrics_total_tracked_decommissioning_maintenance_nodes
`
_Of tracked datanodes in workflow, the container replication state; total
number of containers in each of sufficiently replicated, under-replicated and
unhealthy state_
```
node_decommission_metrics_total_tracked_containers_sufficiently_replicated
node_decommission_metrics_total_tracked_containers_under_replicated
node_decommission_metrics_total_tracked_containers_unhealthy
```
_Of tracked datanodes in workflow, the aggregated number of pipelines
waiting to close_
`node_decommission_metrics_total_tracked_pipelines_waiting_to_close`
_And, the number of datanodes in the workflow that are taken out and
recommissioned._
`node_decommission_metrics_total_tracked_recommission_nodes`
**Similarly exposed via JMX:**
```
{
"name" :
"Hadoop:service=StorageContainerManager,name=NodeDecommissionMetrics",
"modelerType" : "NodeDecommissionMetrics",
"tag.Hostname" : "e68cfe1f098e",
"TotalTrackedDecommissioningMaintenanceNodes" : 0,
"TotalTrackedRecommissionNodes" : 0,
"TotalTrackedPipelinesWaitingToClose" : 0,
"TotalTrackedContainersUnderReplicated" : 0,
"TotalTrackedContainersUnhealthy" : 0,
"TotalTrackedContainersSufficientlyReplicated" : 0
}
```
## What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-2642
## How was this tested?
Unit tests, CI workflow and manually tested with dev docker-cluster entering
nodes in decommissioning workflow monitoring metrics collected in prom endpoint.
**Unit tests:**
`hadoop-hdds/server-scm$ mvn -Dtest=TestNodeDecommissionMetrics test`
INFO] -------------------------------------------------------
[INFO] T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.hadoop.hdds.scm.node.TestNodeDecommissionMetrics
[INFO] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.072
s - in org.apache.hadoop.hdds.scm.node.TestNodeDecommissionMetrics
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0
[INFO]
**Manual testing via dev docker-cluster:**
modify the docker-config for scm serviceid and serviceid-address:
`hadoop-ozone/dist/target/ozone-1.3.0-SNAPSHOT/compose/ozone$`
OZONE-SITE.XML_ozone.scm.nodes.scmservice=scm
OZONE-SITE.XML_ozone.scm.address.scmservice.scm=scm
set docker-compose for monitoring with prometheus:
export COMPOSE_FILE=docker-compose.yaml:monitoring.yaml
`hadoop-ozone/dist/target/ozone-1.3.0-SNAPSHOT/compose/ozone$ docker-compose
up -d --scale datanode=3`
view metrics through prom endpoint : http://localhost:9090
Decomission datanode from scm bash prompt:
`$ ozone admin datanode decommission -id=scmservice --scm=172.26.0.3:9894
3224625960ec`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]