[jira] [Commented] (HDFS-6804) race condition between transferring block and appending block causes "Unexpected checksum mismatch exception"

Max Schmidt (JIRA) Thu, 05 Nov 2015 03:58:58 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-6804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991579#comment-14991579
 ]


Max Schmidt commented on HDFS-6804:
-----------------------------------

We can confirm this in a 3-node cluster with replication-factor=2.

The Exception happend, following this stacktrace, when accessing the corrupt 
file:

Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): append: 
lastBlock=blk_1073742163_2852 of src=testfile is not sufficiently replica
ted yet.
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2692)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2985)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2952)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:653)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:421)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)

        at org.apache.hadoop.ipc.Client.call(Client.java:1476)
        at org.apache.hadoop.ipc.Client.call(Client.java:1407)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
        at com.sun.proxy.$Proxy9.append(Unknown Source)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.append(ClientNamenodeProtocolTranslatorPB.java:328)
        at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
        at com.sun.proxy.$Proxy10.append(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.callAppend(DFSClient.java:1822)
        at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1885)
        at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1855)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:340)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:336)
        at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:348)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:318)
        at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1164)
        at 
io.datapath.ps.FileAccessWriteImpl.<init>(FileAccessWriteImpl.java:29)
        ... 8 more


Is there a way to manually replicate the file?

> race condition between transferring block and appending block causes 
> "Unexpected checksum mismatch exception" 
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-6804
>                 URL: https://issues.apache.org/jira/browse/HDFS-6804
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.2.0
>            Reporter: Gordon Wang
>
> We found some error log in the datanode. like this
> {noformat}
> 2014-07-22 01:49:51,338 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Ex
> ception for BP-2072804351-192.168.2.104-1406008383435:blk_1073741997_9248
> java.io.IOException: Terminating due to a checksum error.java.io.IOException: 
> Unexpected checksum mismatch while writing 
> BP-2072804351-192.168.2.104-1406008383435:blk_1073741997_9248 from 
> /192.168.2.101:39495
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:536)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:703)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:575)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
>         at java.lang.Thread.run(Thread.java:744)
> {noformat}
> While on the source datanode, the log says the block is transmitted.
> {noformat}
> 2014-07-22 01:49:50,805 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Da
> taTransfer: Transmitted 
> BP-2072804351-192.168.2.104-1406008383435:blk_1073741997
> _9248 (numBytes=16188152) to /192.168.2.103:50010
> {noformat}
> When the destination datanode gets the checksum mismatch, it reports bad 
> block to NameNode and NameNode marks the replica on the source datanode as 
> corrupt. But actually, the replica on the source datanode is valid. Because 
> the replica can pass the checksum verification.
> In all, the replica on the source data is wrongly marked as corrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-6804) race condition between transferring block and appending block causes "Unexpected checksum mismatch exception"

Reply via email to