microle.dong created HDFS-17092:
-----------------------------------

             Summary: Datanode Full Block Report failed can lead to missing and 
under replicated blocks
                 Key: HDFS-17092
                 URL: https://issues.apache.org/jira/browse/HDFS-17092
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: datanode
            Reporter: microle.dong


when restarting namenode, we found that some datanodes did not report enough 
blocks, which  can lead to missing and under replicated blocks. 
I found in the logs of the datanode with incomplete block reporting that the 
first FBR attempt failed, due to namenode error

 
{code:java}
2023-07-14 11:29:24,776 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Unsuccessfully sent block report 0x7b738b02996cd2,  containing 12 storage 
report(s), of which we sent 1. The reports had 633033 total blocks and used 1 
RPC(s). This took 169 msec to generate and 97730 msecs for RPC and NN 
processing. Got back no commands.
2023-07-14 11:29:24,776 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
IOException in offerService
java.net.SocketTimeoutException: Call From x.x.x.x/x.x.x.x to x.x.x.x:9002 
failed on socket timeout exception: java.net.SocketTimeoutException: 60000 
millis timeout while waiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/x.x.x.x:13868 
remote=x.x.x.x/x.x.x.x:9002]; For more details see:  
http://wiki.apache.org/hadoop/SocketTimeout 
t sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:863)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:822)
        at org.apache.hadoop.ipc.Client.call(Client.java:1480)
        at org.apache.hadoop.ipc.Client.call(Client.java:1413)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
        at com.sun.proxy.$Proxy14.blockReport(Unknown Source)
        at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:205)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:333)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:572)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:706)
        at java.lang.Thread.run(Thread.java:745){code}
the Datanode second FBR will use same lease , which will make namenode  remove 
the datanode  lease  (just as HDFS-8930) , lead to FBR failed because no lease 
is left.

we should  rest a new lease and try again when datanode FBR failed .

 I am willing to submit a PR to fix this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to