[ 
https://issues.apache.org/jira/browse/HDFS-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17736547#comment-17736547
 ] 

ASF GitHub Bot commented on HDFS-17055:
---------------------------------------

xinglin commented on PR #5764:
URL: https://github.com/apache/hadoop/pull/5764#issuecomment-1604512082

   TestObserverNode unit test failure does not seem to be related with change 
in this PR. The error is connection error in RPC.
   
   ```
   [ERROR] 
testMkdirsRaceWithObserverRead(org.apache.hadoop.hdfs.server.namenode.ha.TestObserverNode)
  Time elapsed: 317.089 s  <<< ERROR!
   java.net.ConnectException: Call From 038ad877ed75/172.17.0.2 to 
localhost:11836 failed on connection exception: java.net.ConnectException: 
Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused
        at sun.reflect.GeneratedConstructorAccessor115.newInstance(Unknown 
Source)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:948)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:863)
        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1588)
        at org.apache.hadoop.ipc.Client.call(Client.java:1529)
        at org.apache.hadoop.ipc.Client.call(Client.java:1426)
   ```
   
   passed all unit tests when running at my laptop as well. 
   ```
   [INFO] -------------------------------------------------------
   [INFO]  T E S T S
   [INFO] -------------------------------------------------------
   [INFO] Running org.apache.hadoop.hdfs.server.namenode.ha.TestObserverNode
   [INFO] Tests run: 20, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
49.418 s - in org.apache.hadoop.hdfs.server.namenode.ha.TestObserverNode
   [INFO]
   [INFO] Results:
   [INFO]
   [INFO] Tests run: 20, Failures: 0, Errors: 0, Skipped: 0
   [INFO] 
------------------------------------------------------------------------
   [INFO] BUILD SUCCESS
   [INFO] 
------------------------------------------------------------------------
   [INFO] Total time:  01:27 min
   [INFO] Finished at: 2023-06-23T09:18:26-07:00
   ```
   
   trigger another build.




> Export HAState as a metric from Namenode for monitoring
> -------------------------------------------------------
>
>                 Key: HDFS-17055
>                 URL: https://issues.apache.org/jira/browse/HDFS-17055
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs
>    Affects Versions: 3.4.0, 3.3.9
>            Reporter: Xing Lin
>            Assignee: Xing Lin
>            Priority: Minor
>              Labels: pull-request-available
>
> We'd like measure the uptime for Namenodes: percentage of time when we have 
> the active/standby/observer node available (up and running). We could monitor 
> the namenode from an external service, such as ZKFC. But that would require 
> the external service to be available 100% itself. And when this third-party 
> external monitoring service is down, we won't have info on whether our 
> Namenodes are still up.
> We propose to take a different approach: we will emit Namenode state directly 
> from namenode itself. Whenever we miss a data point for this metric, we 
> consider the corresponding namenode to be down/not available. In other words, 
> we assume the metric collection/monitoring infrastructure to be 100% reliable.
> One implementation detail: in hadoop, we have the _NameNodeMetrics_ class, 
> which is currently used to emit all metrics for {_}NameNode.java{_}. However, 
> we don't think that is a good place to emit NameNode HAState. HAState is 
> stored in NameNode.java and we should directly emit it from NameNode.java. 
> Otherwise, we basically duplicate this info in two classes and we would have 
> to keep them in sync. Besides, _NameNodeMetrics_ class does not have a 
> reference to the _NameNode_ object which it belongs to. An _NameNodeMetrics_ 
> is created by a _static_ function _initMetrics()_ in {_}NameNode.java{_}.
> We shouldn't emit HA state from FSNameSystem.java either, as it is 
> initialized from NameNode.java and all state transitions are implemented in 
> NameNode.java.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to