[jira] [Commented] (HDFS-16907) BP service actor LastHeartbeat is not sufficient to track realtime connection breaks

ASF GitHub Bot (Jira) Fri, 03 Feb 2023 20:56:20 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-16907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684091#comment-17684091
 ]


ASF GitHub Bot commented on HDFS-16907:
---------------------------------------

ayushtkn commented on code in PR #5349:
URL: https://github.com/apache/hadoop/pull/5349#discussion_r1096481460


##########
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDataNodeMXBean.java:
##########
@@ -294,4 +297,107 @@ public void testDataNodeMXBeanSlowDisksEnabled() throws 
Exception {
       if (cluster != null) {cluster.shutdown();}
     }
   }
+
+  @Test
+  public void testDataNodeMXBeanLastHeartbeats() throws Exception {
+    Configuration conf = new Configuration();
+    try (MiniDFSCluster cluster = new MiniDFSCluster
+        .Builder(conf)
+        .nnTopology(MiniDFSNNTopology.simpleHATopology(2))
+        .numDataNodes(1)
+        .build()) {
+      cluster.waitActive();
+      cluster.transitionToActive(0);
+      cluster.transitionToStandby(1);
+
+      DataNode datanode = cluster.getDataNodes().get(0);
+
+      MBeanServer mbs = ManagementFactory.getPlatformMBeanServer();
+      ObjectName mxbeanName = new ObjectName(
+          "Hadoop:service=DataNode,name=DataNodeInfo");
+
+      // Verify and wait until one of the BP service actor identifies active 
namenode as active
+      // and another as standby.
+      GenericTestUtils.waitFor(() -> {
+        List<Map<String, String>> bpServiceActorInfo = 
datanode.getBPServiceActorInfoMap();
+        Map<String, String> bpServiceActorInfo1 = bpServiceActorInfo.get(0);
+        Map<String, String> bpServiceActorInfo2 = bpServiceActorInfo.get(1);
+        return (HAServiceProtocol.HAServiceState.ACTIVE.toString()
+            .equals(bpServiceActorInfo1.get("NamenodeHaState"))
+            && HAServiceProtocol.HAServiceState.STANDBY.toString()
+            .equals(bpServiceActorInfo2.get("NamenodeHaState")))
+            || (HAServiceProtocol.HAServiceState.ACTIVE.toString()
+            .equals(bpServiceActorInfo2.get("NamenodeHaState"))
+            && HAServiceProtocol.HAServiceState.STANDBY.toString()
+            .equals(bpServiceActorInfo1.get("NamenodeHaState")));
+      },
+          500,
+          8000,
+          "No namenode is reported active");

Review Comment:
   This looks like checking if we have the datanode has acknowledged the active 
namenode or not? Can we have this util as part of MiniDfsCluster?
   We have something like ``waitDatanodeFullyStarted`` there, may be a new 
method with an extra param, checkActive or something like that, which can may 
be in the Datanode check for Active NN also
   
   ```
     public boolean isDatanodeFullyStarted() {
       for (BPOfferService bp : blockPoolManager.getAllNamenodeThreads()) {
         if (!bp.isInitialized() || !bp.isAlive() || bp.getActiveNN()==null) {
           return false;
         }
       }
   ```
   
   or at worst refactor the present somewhere into a method





> BP service actor LastHeartbeat is not sufficient to track realtime connection 
> breaks
> ------------------------------------------------------------------------------------
>
>                 Key: HDFS-16907
>                 URL: https://issues.apache.org/jira/browse/HDFS-16907
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Viraj Jasani
>            Assignee: Viraj Jasani
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: Screenshot 2023-02-03 at 6.12.24 PM.png
>
>
> Each BP service actor thread maintains _lastHeartbeatTime_ with the namenode 
> that it is connected to. However, this is updated even if the connection to 
> the namenode is broken.
> Suppose, the actor thread keeps heartbeating to namenode and suddenly the 
> socket connection is broken. When this happens, until specific time duration, 
> the actor thread consistently keeps updating _lastHeartbeatTime_ before even 
> initiating heartbeat connection with namenode. If connection cannot be 
> established even after RPC retries are exhausted, then IOException is thrown. 
> This means that heartbeat response has not been received from the namenode. 
> In the loop, the actor thread keeps trying connecting for heartbeat and the 
> last heartbeat stays close to 1/2s even though in reality there is no 
> response being received from namenode.
>  
> Sample Exception from the BP service actor thread, during which LastHeartbeat 
> stays very low:
> {code:java}
> 2023-02-03 22:34:55,725 WARN  [xyz:9000] datanode.DataNode - IOException in 
> offerService
> java.io.EOFException: End of File Exception between local host is: "dn-0"; 
> destination host is: "nn-1":9000; : java.io.EOFException; For more details 
> see:  http://wiki.apache.org/hadoop/EOFException
>     at sun.reflect.GeneratedConstructorAccessor34.newInstance(Unknown Source)
>     at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>     at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>     at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:913)
>     at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:862)
>     at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1553)
>     at org.apache.hadoop.ipc.Client.call(Client.java:1495)
>     at org.apache.hadoop.ipc.Client.call(Client.java:1392)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:242)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:129)
>     at com.sun.proxy.$Proxy17.sendHeartbeat(Unknown Source)
>     at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:168)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:544)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:682)
>     at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:890)
>     at java.lang.Thread.run(Thread.java:750)
> Caused by: java.io.EOFException
>     at java.io.DataInputStream.readInt(DataInputStream.java:392)
>     at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1884)
>     at 
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1176)
>     at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1074) {code}
> Attaching screenshots of how last heartbeat value looks when the above error 
> is consistently getting logged.
>  
> Last heartbeat response time is important to initiate any auto-recovery from 
> datanode. Hence, we should introduce LastHeartbeatResponseTime that only gets 
> updated if the BP service actor thread was successfully able to retrieve 
> response from namenode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16907) BP service actor LastHeartbeat is not sufficient to track realtime connection breaks

Reply via email to