[ https://issues.apache.org/jira/browse/HDFS-17157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Haoze Wu updated HDFS-17157: ---------------------------- Description: This case is related to HDFS-12070. In HDFS-12070, we saw how a faulty drive at a certain datanode could lead to permanent block recovery failure and leaves the file open indefinitely. In the patch, instead of failing the whole lease recovery process when the second stage of block recovery is failed at one datanode, the whole lease recovery process is failed if only these are failed for all the datanodes. Attached is the code snippet for the second stage of the block recovery, in BlockRecoveryWorker#syncBlock: {code:java} ... final List<BlockRecord> successList = new ArrayList<>(); for (BlockRecord r : participatingList) { try { r.updateReplicaUnderRecovery(bpid, recoveryId, blockId, newBlock.getNumBytes()); successList.add(r); } catch (IOException e) { ...{code} However, because of transient network failure, the RPC in updateReplicaUnderRecovery initiated from the primary datanode to another datanode could return an EOFException while the other side does not process the RPC at all or throw an IOException when reading from the socket. {code:java} at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:824) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:788) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1495) at org.apache.hadoop.ipc.Client.call(Client.java:1437) at org.apache.hadoop.ipc.Client.call(Client.java:1347) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) at com.sun.proxy.$Proxy29.updateReplicaUnderRecovery(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolTranslatorPB.updateReplicaUnderRecovery(InterDatanodeProtocolTranslatorPB.java:112) at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$BlockRecord.updateReplicaUnderRecovery(BlockRecoveryWorker.java:88) at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$BlockRecord.access$700(BlockRecoveryWorker.java:71) at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.syncBlock(BlockRecoveryWorker.java:300) at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.recover(BlockRecoveryWorker.java:188) at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:606) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1796) at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1165) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1061) {code} Then if there is any other datanode in which the second stage of block recovery success, the lease recovery would be successful and close the file. However, the last block failed to be synced to that failed datanode and this inconsistency could potentially last for a very long time. To fix the issue, I propose adding a configurable retry of updateReplicaUnderRecovery RPC so that transient network failure could be mitigated. was: This case is related to HDFS-12070. In HDFS-12070, we saw how a faulty drive at a certain datanode could lead to permanent block recovery failure and leaves the file open indefinitely. In the patch, instead of failing the whole lease recovery process when the second stage of block recovery is failed at one datanode, the whole lease recovery process is failed if only these are failed for all the datanodes. Attached is the code snippet for the second stage of the block recovery, in BlockRecoveryWorker#syncBlock: {code:java} ... final List<BlockRecord> successList = new ArrayList<>(); for (BlockRecord r : participatingList) { try { r.updateReplicaUnderRecovery(bpid, recoveryId, blockId, newBlock.getNumBytes()); successList.add(r); } catch (IOException e) { ...{code} However, because of transient network failure, the RPC in updateReplicaUnderRecovery initiated from the primary datanode to another datanode could return an EOFException while the other side does not process the RPC at all or throw an IOException when reading from the socket. {code:java} at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:824) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:788) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1495) at org.apache.hadoop.ipc.Client.call(Client.java:1437) at org.apache.hadoop.ipc.Client.call(Client.java:1347) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) at com.sun.proxy.$Proxy29.updateReplicaUnderRecovery(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolTranslatorPB.updateReplicaUnderRecovery(InterDatanodeProtocolTranslatorPB.java:112) at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$BlockRecord.updateReplicaUnderRecovery(BlockRecoveryWorker.java:88) at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$BlockRecord.access$700(BlockRecoveryWorker.java:71) at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.syncBlock(BlockRecoveryWorker.java:300) at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.recover(BlockRecoveryWorker.java:188) at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:606) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1796) at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1165) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1061) {code} Then if there is any other datanode in which the second stage of block recovery success, the lease recovery would be successful and close the file. However, the last block failed to be synced to that failed datanode and this inconsistency could potentially last for a very long time. To fix the issue, I propose adding a configurable retry of updateReplicaUnderRecovery RPC so that transient network failure could be tolerated. > Transient network failure in lease recovery could lead to the block in a > datanode in an inconsisetnt state for a long time > -------------------------------------------------------------------------------------------------------------------------- > > Key: HDFS-17157 > URL: https://issues.apache.org/jira/browse/HDFS-17157 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode > Affects Versions: 2.0.0-alpha > Reporter: Haoze Wu > Priority: Major > > This case is related to HDFS-12070. > In HDFS-12070, we saw how a faulty drive at a certain datanode could lead to > permanent block recovery failure and leaves the file open indefinitely. In > the patch, instead of failing the whole lease recovery process when the > second stage of block recovery is failed at one datanode, the whole lease > recovery process is failed if only these are failed for all the datanodes. > Attached is the code snippet for the second stage of the block recovery, in > BlockRecoveryWorker#syncBlock: > {code:java} > ... > final List<BlockRecord> successList = new ArrayList<>(); > for (BlockRecord r : participatingList) { > try { > r.updateReplicaUnderRecovery(bpid, recoveryId, blockId, > newBlock.getNumBytes()); > successList.add(r); > } catch (IOException e) { > ...{code} > However, because of transient network failure, the RPC in > updateReplicaUnderRecovery initiated from the primary datanode to another > datanode could return an EOFException while the other side does not process > the RPC at all or throw an IOException when reading from the socket. > {code:java} > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:824) > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:788) > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1495) > at org.apache.hadoop.ipc.Client.call(Client.java:1437) > at org.apache.hadoop.ipc.Client.call(Client.java:1347) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > at com.sun.proxy.$Proxy29.updateReplicaUnderRecovery(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolTranslatorPB.updateReplicaUnderRecovery(InterDatanodeProtocolTranslatorPB.java:112) > at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$BlockRecord.updateReplicaUnderRecovery(BlockRecoveryWorker.java:88) > at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$BlockRecord.access$700(BlockRecoveryWorker.java:71) > at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.syncBlock(BlockRecoveryWorker.java:300) > at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.recover(BlockRecoveryWorker.java:188) > at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:606) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1796) > at > org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1165) > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1061) > {code} > Then if there is any other datanode in which the second stage of block > recovery success, the lease recovery would be successful and close the file. > However, the last block failed to be synced to that failed datanode and this > inconsistency could potentially last for a very long time. > To fix the issue, I propose adding a configurable retry of > updateReplicaUnderRecovery RPC so that transient network failure could be > mitigated. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org