ZanderXu created HDFS-16987:
-------------------------------
Summary: HA Failover may cause some corrupted blocks
Key: HDFS-16987
URL: https://issues.apache.org/jira/browse/HDFS-16987
Project: Hadoop HDFS
Issue Type: Bug
Reporter: ZanderXu
Assignee: ZanderXu
In our prod environment, we encountered an incident where HA failover caused
some new corrupted blocks, causing some jobs to fail.
Traced down and found a bug in the processing of all pending DN messages when
starting active services.
The steps to reproduce are as follows:
# Suppose NN1 is Active and NN2 is Standby, Active works well and Standby is
unstable
# Timing 1, client create a file, write some data and close it.
# Timing 2, client append this file, write some data and close it.
# Timing 3, Standby replayed the second closing edits of this file
# Timing 4, Standby processes the blockReceivedAndDeleted of the first create
operation
# Timing 5, Standby processed the blockReceivedAndDeleted of the second append
operation
# Timing 6, Admin switched the active namenode from NN1 to NN2
# Timing 7, client failed to append some data to this file.
{code:java}
org.apache.hadoop.ipc.RemoteException(java.io.IOException): append:
lastBlock=blk_1073741825_1002 of src=/testCorruptedBlockAfterHAFailover is not
sufficiently replicated yet.
at
org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:138)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2992)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:858)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:527)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:621)
at
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
at
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1221)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1144)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1953)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3170) {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]