[ https://issues.apache.org/jira/browse/HDFS-16987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719695#comment-17719695 ]
ASF GitHub Bot commented on HDFS-16987: --------------------------------------- ZanderXu commented on PR #5583: URL: https://github.com/apache/hadoop/pull/5583#issuecomment-1535815925 > IIUC, the replica's GS is monotonically increasing at DataNode side, right? Should we consider the case of manual destruction? such as, manually move the meta file to a small GS file. > NameNode should remove all invalid corrupted blocks when starting active > service > -------------------------------------------------------------------------------- > > Key: HDFS-16987 > URL: https://issues.apache.org/jira/browse/HDFS-16987 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: ZanderXu > Assignee: ZanderXu > Priority: Critical > Labels: pull-request-available > > In our prod environment, we encountered an incident where HA failover caused > some new corrupted blocks, causing some jobs to fail. > > Traced down and found a bug in the processing of all pending DN messages when > starting active services. > The steps to reproduce are as follows: > # Suppose NN1 is Active and NN2 is Standby, Active works well and Standby is > unstable > # Timing 1, client create a file, write some data and close it. > # Timing 2, client append this file, write some data and close it. > # Timing 3, Standby replayed the second closing edits of this file > # Timing 4, Standby processes the blockReceivedAndDeleted of the first > create operation > # Timing 5, Standby processed the blockReceivedAndDeleted of the second > append operation > # Timing 6, Admin switched the active namenode from NN1 to NN2 > # Timing 7, client failed to append some data to this file. > {code:java} > org.apache.hadoop.ipc.RemoteException(java.io.IOException): append: > lastBlock=blk_1073741825_1002 of src=/testCorruptedBlockAfterHAFailover is > not sufficiently replicated yet. > at > org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:138) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2992) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:858) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:527) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:621) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1221) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1144) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1953) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3170) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org