neils-dev opened a new pull request, #3781:
URL: https://github.com/apache/ozone/pull/3781

   ## What changes were proposed in this pull request?
   To expose metrics from nodes entering the decommissioning and maintenance 
workflow to JMX and prom endpoint.  These metrics expose the number of 
datanodes in the workflow, the container replication state of tracked nodes and 
the number of pipelines waiting to close of tracked nodes.  With the following 
exposed metrics from the `NodeDecommissionManager` through the 
`DataAdminMonitorImpl` the progress of the decommission and maintenance 
workflow can be monitored.
   
   The progress of datanodes going though the workflow are monitored through 
aggregated counts of the number of tracked nodes, their number of pipelines 
waiting to close and the number of containers in each of sufficiently, 
under-replicated and unhealthy state.  The metrics collected are as discussed 
in the associated Jira comments,
   
   **As exposed to prom endpoint:**
   
   _aggregated total number of datanodes in workflow:_
   `node_decommission_metrics_total_tracked_decommissioning_maintenance_nodes   
   `
   
   _Of tracked datanodes in workflow, the container replication state; total 
number of containers in each of sufficiently replicated, under-replicated and 
unhealthy state_
   ```
   node_decommission_metrics_total_tracked_containers_sufficiently_replicated
   node_decommission_metrics_total_tracked_containers_under_replicated
   node_decommission_metrics_total_tracked_containers_unhealthy
   
   ```
   _Of tracked datanodes in workflow, the aggregated number of pipelines 
waiting to close_
   `node_decommission_metrics_total_tracked_pipelines_waiting_to_close`
   
   
   _And, the number of datanodes in the workflow that are taken out and 
recommissioned._
   `node_decommission_metrics_total_tracked_recommission_nodes`
   
   **Similarly exposed via JMX:**
   ```
    {
       "name" : 
"Hadoop:service=StorageContainerManager,name=NodeDecommissionMetrics",
       "modelerType" : "NodeDecommissionMetrics",
       "tag.Hostname" : "e68cfe1f098e",
       "TotalTrackedDecommissioningMaintenanceNodes" : 0,
       "TotalTrackedRecommissionNodes" : 0,
       "TotalTrackedPipelinesWaitingToClose" : 0,
       "TotalTrackedContainersUnderReplicated" : 0,
       "TotalTrackedContainersUnhealthy" : 0,
       "TotalTrackedContainersSufficientlyReplicated" : 0
     }
   ```
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-2642
   
   ## How was this  tested?
   Unit tests, CI workflow and manually tested with dev docker-cluster entering 
nodes in decommissioning workflow monitoring metrics collected in prom endpoint.
   
   **Unit tests:**
   
   `hadoop-hdds/server-scm$ mvn -Dtest=TestNodeDecommissionMetrics test`
   
   INFO] -------------------------------------------------------
   [INFO]  T E S T S
   [INFO] -------------------------------------------------------
   [INFO] Running org.apache.hadoop.hdds.scm.node.TestNodeDecommissionMetrics
   [INFO] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.072 
s - in org.apache.hadoop.hdds.scm.node.TestNodeDecommissionMetrics
   [INFO] 
   [INFO] Results:
   [INFO] 
   [INFO] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0
   [INFO] 
   
   
   **Manual testing via dev docker-cluster:**
   modify the docker-config for scm serviceid and serviceid-address:
   `hadoop-ozone/dist/target/ozone-1.3.0-SNAPSHOT/compose/ozone$`
   OZONE-SITE.XML_ozone.scm.nodes.scmservice=scm
   OZONE-SITE.XML_ozone.scm.address.scmservice.scm=scm
   
   set docker-compose for monitoring with prometheus:
   export COMPOSE_FILE=docker-compose.yaml:monitoring.yaml
   `hadoop-ozone/dist/target/ozone-1.3.0-SNAPSHOT/compose/ozone$ docker-compose 
up -d --scale datanode=3`
   
   view metrics through prom endpoint : http://localhost:9090
   Decomission datanode from scm bash prompt:
   `$ ozone admin datanode decommission -id=scmservice --scm=172.26.0.3:9894 
3224625960ec`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to