[ 
https://issues.apache.org/jira/browse/HDFS-16987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719695#comment-17719695
 ] 

ASF GitHub Bot commented on HDFS-16987:
---------------------------------------

ZanderXu commented on PR #5583:
URL: https://github.com/apache/hadoop/pull/5583#issuecomment-1535815925

   > IIUC, the replica's GS is monotonically increasing at DataNode side, right?
   
   Should we consider the case of manual destruction? such as, manually move 
the meta file to a small GS file. 




> NameNode should remove all invalid corrupted blocks when starting active 
> service
> --------------------------------------------------------------------------------
>
>                 Key: HDFS-16987
>                 URL: https://issues.apache.org/jira/browse/HDFS-16987
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: ZanderXu
>            Assignee: ZanderXu
>            Priority: Critical
>              Labels: pull-request-available
>
> In our prod environment, we encountered an incident where HA failover caused 
> some new corrupted blocks, causing some jobs to fail.
>  
> Traced down and found a bug in the processing of all pending DN messages when 
> starting active services.
> The steps to reproduce are as follows:
>  # Suppose NN1 is Active and NN2 is Standby, Active works well and Standby is 
> unstable
>  # Timing 1, client create a file, write some data and close it.
>  # Timing 2, client append this file, write some data and close it.
>  # Timing 3, Standby replayed the second closing edits of this file
>  # Timing 4, Standby processes the blockReceivedAndDeleted of the first 
> create operation
>  # Timing 5, Standby processed the blockReceivedAndDeleted of the second 
> append operation
>  # Timing 6, Admin switched the active namenode from NN1 to NN2
>  # Timing 7, client failed to append some data to this file.
> {code:java}
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): append: 
> lastBlock=blk_1073741825_1002 of src=/testCorruptedBlockAfterHAFailover is 
> not sufficiently replicated yet.
>     at 
> org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:138)
>     at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2992)
>     at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:858)
>     at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:527)
>     at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:621)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1221)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1144)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1953)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3170) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to