[jira] [Commented] (HDFS-16853) The UT TestLeaseRecovery2#testHardLeaseRecoveryAfterNameNodeRestart failed because HADOOP-18324

Owen O'Malley (Jira) Tue, 07 Feb 2023 14:09:05 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-16853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685487#comment-17685487
 ]


Owen O'Malley commented on HDFS-16853:
--------------------------------------

The description is wrong. The SychronousQueue has no storage and thus doesn't 
need to be cleaned up. The problem is that between the check at the top of 
sendRpcRequest and when it offers the serialized bytes the other thread was 
stopped.

Unfortunately, just making sendRpcRequest synchronous, which would fix the race 
condition, wouldn't be ok because we can't hold the lock while we wait for our 
turn in the queue.

The proposed fix doesn't fix the race condition because it releases the lock 
before putting the message in the queue.

Let me look at what we can do.

 

> The UT TestLeaseRecovery2#testHardLeaseRecoveryAfterNameNodeRestart failed 
> because HADOOP-18324
> -----------------------------------------------------------------------------------------------
>
>                 Key: HDFS-16853
>                 URL: https://issues.apache.org/jira/browse/HDFS-16853
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 3.3.5
>            Reporter: ZanderXu
>            Assignee: ZanderXu
>            Priority: Blocker
>              Labels: pull-request-available
>
> The UT TestLeaseRecovery2#testHardLeaseRecoveryAfterNameNodeRestart failed 
> with error message: Waiting for cluster to become active. And the blocking 
> jstack as bellows:
> {code:java}
> "BP-1618793397-192.168.3.4-1669198559828 heartbeating to 
> localhost/127.0.0.1:54673" #260 daemon prio=5 os_prio=31 tid=0x
> 00007fc1108fa000 nid=0x19303 waiting on condition [0x0000700017884000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00000007430a9ec0> (a 
> java.util.concurrent.SynchronousQueue$TransferQueue)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at 
> java.util.concurrent.SynchronousQueue$TransferQueue.awaitFulfill(SynchronousQueue.java:762)
>         at 
> java.util.concurrent.SynchronousQueue$TransferQueue.transfer(SynchronousQueue.java:695)
>         at 
> java.util.concurrent.SynchronousQueue.put(SynchronousQueue.java:877)
>         at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1186)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1482)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1429)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139)
>         at com.sun.proxy.$Proxy23.sendHeartbeat(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClient
> SideTranslatorPB.java:168)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:570)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:714)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:915)
>         at java.lang.Thread.run(Thread.java:748)  {code}
> After looking into the code and found that this bug is imported by 
> HADOOP-18324. Because RpcRequestSender exited without cleaning up the 
> rpcRequestQueue, then caused BPServiceActor was blocked in sending request.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16853) The UT TestLeaseRecovery2#testHardLeaseRecoveryAfterNameNodeRestart failed because HADOOP-18324

Reply via email to