[ https://issues.apache.org/jira/browse/HDFS-13758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
chencan updated HDFS-13758: --------------------------- Assignee: chencan Status: Patch Available (was: Open) > DatanodeManager should throw exception if it has BlockRecoveryCommand but the > block is not under construction > ------------------------------------------------------------------------------------------------------------- > > Key: HDFS-13758 > URL: https://issues.apache.org/jira/browse/HDFS-13758 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 3.0.0-alpha1 > Reporter: Wei-Chiu Chuang > Assignee: chencan > Priority: Major > Attachments: HDFS-10240 scenarios.jpg, HDFS-13758.001.patch > > > In Hadoop 3, HDFS-8909 added an assertion assumption that if a > BlockRecoveryCommand exists for a block, the block is under construction. > > {code:title=DatanodeManager#getBlockRecoveryCommand()} > BlockRecoveryCommand brCommand = new BlockRecoveryCommand(blocks.length); > for (BlockInfo b : blocks) { > BlockUnderConstructionFeature uc = b.getUnderConstructionFeature(); > assert uc != null; > ... > {code} > This assertion accidentally fixed one of the possible scenario of HDFS-10240 > data corruption, if a recoverLease() is made immediately followed by a > close(), before DataNodes have the chance to heartbeat. > In a unit test you'll get: > {noformat} > 2018-07-19 09:43:41,331 [IPC Server handler 9 on 57890] WARN ipc.Server > (Server.java:logException(2724)) - IPC Server handler 9 on 57890, call > Call#41 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from > 127.0.0.1:57903 > java.lang.AssertionError > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.getBlockRecoveryCommand(DatanodeManager.java:1551) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleHeartbeat(DatanodeManager.java:1661) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleHeartbeat(FSNamesystem.java:3865) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendHeartbeat(NameNodeRpcServer.java:1504) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.sendHeartbeat(DatanodeProtocolServerSideTranslatorPB.java:119) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:31660) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1689) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) > {noformat} > I propose to change this assertion even though it address the data > corruption, because: > # We should throw an more meaningful exception than an NPE > # on a production cluster, the assert is ignored, and you'll get a more > noticeable NPE. Future HDFS developers might fix this NPE, causing > regression. An NPE is typically not captured and handled, so there's a chance > to result in internal state inconsistency. > # It doesn't address all possible scenarios of HDFS-10240. A proper fix > should reject close() if the block is being recovered. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org