[jira] [Updated] (HDFS-17092) Datanode Full Block Report failed can lead to missing and under replicated blocks

microle.dong (Jira) Mon, 17 Jul 2023 04:01:01 -0700


     [ 
https://issues.apache.org/jira/browse/HDFS-17092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


microle.dong updated HDFS-17092:
--------------------------------
    Description: 
when restarting namenode, we found that some datanodes did not report enough 
blocks, which  can lead to missing and under replicated blocks. 
Datanode use multipul RPCs to report blocks,  I found in the logs of the 
datanode with incomplete block reporting that the first FBR attempt failed, due 
to namenode error

 
{code:java}
2023-07-14 17:29:24,776 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Unsuccessfully sent block report 0x7b738b02996cd2,  containing 12 storage 
report(s), of which we sent 1. The reports had 633013 total blocks and used 1 
RPC(s). This took 234 msec to generate and 98739 msecs for RPC and NN 
processing. Got back no commands.
2023-07-14 17:29:24,776 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
IOException in offerService
java.net.SocketTimeoutException: Call From x.x.x.x/x.x.x.x to x.x.x.x:9002 
failed on socket timeout exception: java.net.SocketTimeoutException: 60000 
millis timeout while waiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/x.x.x.x:13868 
remote=x.x.x.x/x.x.x.x:9002]; For more details see:  
http://wiki.apache.org/hadoop/SocketTimeout 
t sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:863)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:822)
        at org.apache.hadoop.ipc.Client.call(Client.java:1480)
        at org.apache.hadoop.ipc.Client.call(Client.java:1413)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
        at com.sun.proxy.$Proxy14.blockReport(Unknown Source)
        at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:205)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:333)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:572)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:706)
        at java.lang.Thread.run(Thread.java:745){code}
the Datanode second FBR will use same lease , which will make namenode  remove 
the datanode  lease  (just as HDFS-8930) , lead to other FBR RPC failed because 
no lease is left.

we should  rest a new lease and try again when datanode FBR failed .

 I am willing to submit a PR to fix this.

  was:
when restarting namenode, we found that some datanodes did not report enough 
blocks, which  can lead to missing and under replicated blocks. 
I found in the logs of the datanode with incomplete block reporting that the 
first FBR attempt failed, due to namenode error

 
{code:java}
2023-07-14 11:29:24,776 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Unsuccessfully sent block report 0x7b738b02996cd2,  containing 12 storage 
report(s), of which we sent 1. The reports had 633033 total blocks and used 1 
RPC(s). This took 169 msec to generate and 97730 msecs for RPC and NN 
processing. Got back no commands.
2023-07-14 11:29:24,776 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
IOException in offerService
java.net.SocketTimeoutException: Call From x.x.x.x/x.x.x.x to x.x.x.x:9002 
failed on socket timeout exception: java.net.SocketTimeoutException: 60000 
millis timeout while waiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/x.x.x.x:13868 
remote=x.x.x.x/x.x.x.x:9002]; For more details see:  
http://wiki.apache.org/hadoop/SocketTimeout 
t sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:863)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:822)
        at org.apache.hadoop.ipc.Client.call(Client.java:1480)
        at org.apache.hadoop.ipc.Client.call(Client.java:1413)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
        at com.sun.proxy.$Proxy14.blockReport(Unknown Source)
        at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:205)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:333)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:572)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:706)
        at java.lang.Thread.run(Thread.java:745){code}
the Datanode second FBR will use same lease , which will make namenode  remove 
the datanode  lease  (just as HDFS-8930) , lead to FBR failed because no lease 
is left.

we should  rest a new lease and try again when datanode FBR failed .

 I am willing to submit a PR to fix this.


> Datanode Full Block Report failed can lead to missing and under replicated 
> blocks
> ---------------------------------------------------------------------------------
>
>                 Key: HDFS-17092
>                 URL: https://issues.apache.org/jira/browse/HDFS-17092
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>            Reporter: microle.dong
>            Priority: Major
>
> when restarting namenode, we found that some datanodes did not report enough 
> blocks, which  can lead to missing and under replicated blocks. 
> Datanode use multipul RPCs to report blocks,  I found in the logs of the 
> datanode with incomplete block reporting that the first FBR attempt failed, 
> due to namenode error
>  
> {code:java}
> 2023-07-14 17:29:24,776 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Unsuccessfully sent block report 0x7b738b02996cd2,  containing 12 storage 
> report(s), of which we sent 1. The reports had 633013 total blocks and used 1 
> RPC(s). This took 234 msec to generate and 98739 msecs for RPC and NN 
> processing. Got back no commands.
> 2023-07-14 17:29:24,776 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> IOException in offerService
> java.net.SocketTimeoutException: Call From x.x.x.x/x.x.x.x to x.x.x.x:9002 
> failed on socket timeout exception: java.net.SocketTimeoutException: 60000 
> millis timeout while waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/x.x.x.x:13868 
> remote=x.x.x.x/x.x.x.x:9002]; For more details see:  
> http://wiki.apache.org/hadoop/SocketTimeout 
> t sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>         at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>         at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:863)
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:822)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1480)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1413)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>         at com.sun.proxy.$Proxy14.blockReport(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:205)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:333)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:572)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:706)
>         at java.lang.Thread.run(Thread.java:745){code}
> the Datanode second FBR will use same lease , which will make namenode  
> remove the datanode  lease  (just as HDFS-8930) , lead to other FBR RPC 
> failed because no lease is left.
> we should  rest a new lease and try again when datanode FBR failed .
>  I am willing to submit a PR to fix this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-17092) Datanode Full Block Report failed can lead to missing and under replicated blocks

Reply via email to