[ 
https://issues.apache.org/jira/browse/HDDS-15266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18080771#comment-18080771
 ] 

Andrey Yarovoy commented on HDDS-15266:
---------------------------------------

why not report unhealthy containers via metrics from DN?

> Ability to trace container state transitions across a cluster
> -------------------------------------------------------------
>
>                 Key: HDDS-15266
>                 URL: https://issues.apache.org/jira/browse/HDDS-15266
>             Project: Apache Ozone
>          Issue Type: Wish
>            Reporter: Wei-Chiu Chuang
>            Priority: Major
>
> Request for proposal to improve observability around the container health 
> state cluster-wide.
>  
> Current state:
>    * Container Audit Logs: Accurate. Container state transitions are 
> currently audited in HddsDispatcher.java on individual Data Nodes, making 
> cluster-wide health tracking difficult. Single datanode scope allows 
> monitoring of container state transitions. However, container health state 
> requires holistic view across the cluster. For example, when a container 
> becomes over-replicated, when does a container becomes unhealthy; at what 
> timestamp and the corresponding ratis transaction id.
>  
> One possible approach is to implement a mechanism for Data Nodes to report 
> container state transition events to SCM (or Recon) via heartbeats, allowing 
> SCM to expose a holistic, cluster-wide metric for container transitions.
>  
> Open to suggestions.
>  
> References:
> [https://ozone.apache.org/docs/next/system-internals/replication/data/replication-manager#container-state-descriptions]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to