[jira] [Commented] (HDFS-3157) Error in deleting block is keep on coming from DN even after the block report and directory scanning has happened

Ashish Singhi (JIRA) Sun, 13 May 2012 03:47:19 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13274245#comment-13274245
 ]


Ashish Singhi commented on HDFS-3157:
-------------------------------------

Currently I am working on the following solution for the patch - Rebuilding the 
blockInfo just with reported block genstamp and other all states same as 
storedBlock.
Again with this solution, the test case may randomly fail. Reason, 
Now though the reported block is added into corruptReplicasMap it is not 
getting invalidated on the DN who is reporting this corrupt block, because for 
the corrupt block to get invalidated first we need to meet the live replicas 
for the block equal to the replication factor set.
Problem - If chooseTarget() picks the same DN who is reporting this corrupt 
block then it will fail with ReplicaAlreadyExistsException.
Now question is why NN is picking the same DN who is reporting this corrupt 
block not the 3rd DN ?
Answer - In excludedNodes map only one DN will be present who has the live 
replica of the block( or who has the block in his Finalized folder).
The following partial logs depicits the above scenario.
{code}
excludedNodes contains the following datanode/s.
{127.0.0.1:54681=127.0.0.1:54681}
2012-05-12 23:57:33,773 INFO  hdfs.StateChange 
(BlockManager.java:computeReplicationWorkForBlocks(1226)) - BLOCK* ask 
127.0.0.1:54681 to replicate blk_3471690017167574595_1003 to datanode(s) 
127.0.0.1:54041
2012-05-12 23:57:33,791 INFO  datanode.DataNode 
(DataNode.java:transferBlock(1221)) - DatanodeRegistration(127.0.0.1, 
storageID=DS-1047816814-192.168.44.128-54681-1336847251649, infoPort=62840, 
ipcPort=26036, storageInfo=lv=-40;cid=testClusterID;nsid=1646783488;c=0) 
Starting thread to transfer block 
BP-1770179175-192.168.44.128-1336847247907:blk_3471690017167574595_1003 to 
127.0.0.1:54041
2012-05-12 23:57:33,795 INFO  hdfs.StateChange 
(BlockManager.java:processReport(1450)) - BLOCK* processReport: from 
DatanodeRegistration(127.0.0.1, 
storageID=DS-1047816814-192.168.44.128-54681-1336847251649, infoPort=62840, 
ipcPort=26036, storageInfo=lv=-40;cid=testClusterID;nsid=1646783488;c=0), 
blocks: 1, processing time: 0 msecs
2012-05-12 23:57:33,796 INFO  datanode.DataNode 
(BPServiceActor.java:blockReport(404)) - BlockReport of 1 blocks took 0 msec to 
generate and 2 msecs for RPC and NN processing
2012-05-12 23:57:33,796 INFO  datanode.DataNode 
(BPServiceActor.java:blockReport(423)) - sent block report, processed 
command:org.apache.hadoop.hdfs.server.protocol.FinalizeCommand@12eb0b3
2012-05-12 23:57:33,811 INFO  datanode.DataNode 
(DataXceiver.java:writeBlock(342)) - Receiving block 
BP-1770179175-192.168.44.128-1336847247907:blk_3471690017167574595_1003 src: 
/127.0.0.1:33583 dest: /127.0.0.1:54041
2012-05-12 23:57:33,812 INFO  datanode.DataNode 
(DataXceiver.java:writeBlock(495)) - opWriteBlock 
BP-1770179175-192.168.44.128-1336847247907:blk_3471690017167574595_1003 
received exception 
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block 
BP-1770179175-192.168.44.128-1336847247907:blk_3471690017167574595_1003 already 
exists in state RBW and thus cannot be created.
2012-05-12 23:57:33,814 ERROR datanode.DataNode (DataXceiver.java:run(193)) - 
127.0.0.1:54041:DataXceiver error processing WRITE_BLOCK operation  src: 
/127.0.0.1:33583 dest: /127.0.0.1:54041
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block 
BP-1770179175-192.168.44.128-1336847247907:blk_3471690017167574595_1003 already 
exists in state RBW and thus cannot be created.
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:795)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.&lt;init&gt;(BlockReceiver.java:151)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:365)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:66)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:189)
        at java.lang.Thread.run(Thread.java:619)
2012-05-12 23:57:33,815 INFO  datanode.DataNode (DataNode.java:run(1406)) - 
DataTransfer: Transmitted 
BP-1770179175-192.168.44.128-1336847247907:blk_3471690017167574595_1003 
(numBytes=100) to /127.0.0.1:54041
2012-05-12 23:57:34,066 INFO  hdfs.StateChange 
(BlockManager.java:processReport(1450)) - BLOCK* processReport: from 
DatanodeRegistration(127.0.0.1, 
storageID=DS-610636930-192.168.44.128-20029-1336847250644, infoPort=52843, 
ipcPort=46734, storageInfo=lv=-40;cid=testClusterID;nsid=1646783488;c=0), 
blocks: 0, processing time: 0 msecs
2012-05-12 23:57:34,067 INFO  datanode.DataNode 
(BPServiceActor.java:blockReport(404)) - BlockReport of 0 blocks took 0 msec to 
generate and 3 msecs for RPC and NN processing
2012-05-12 23:57:34,068 INFO  datanode.DataNode 
(BPServiceActor.java:blockReport(423)) - sent block report, processed 
command:org.apache.hadoop.hdfs.server.protocol.FinalizeCommand@a1364a
2012-05-12 23:57:34,099 INFO  hdfs.StateChange 
(CorruptReplicasMap.java:addToCorruptReplicasMap(66)) - BLOCK 
NameSystem.addToCorruptReplicasMap: blk_3471690017167574595 added as corrupt on 
127.0.0.1:54041 by /127.0.0.1 because reported RBW replica with genstamp 1002 
does not match COMPLETE block&apos;s genstamp in block map 1003
2012-05-12 23:57:34,100 INFO  hdfs.StateChange 
(BlockManager.java:processReport(1450)) - BLOCK* processReport: from 
DatanodeRegistration(127.0.0.1, 
storageID=DS-1452741455-192.168.44.128-54041-1336847250645, infoPort=10314, 
ipcPort=16230, storageInfo=lv=-40;cid=testClusterID;nsid=1646783488;c=0), 
blocks: 1, processing time: 2 msecs
2012-05-12 23:57:34,101 INFO  datanode.DataNode 
(BPServiceActor.java:blockReport(404)) - BlockReport of 1 blocks took 0 msec to 
generate and 4 msecs for RPC and NN processing
2012-05-12 23:57:34,101 INFO  datanode.DataNode 
(BPServiceActor.java:blockReport(423)) - sent block report, processed 
command:org.apache.hadoop.hdfs.server.protocol.FinalizeCommand@17194a4
2012-05-12 23:57:34,775 INFO  hdfs.StateChange 
(BlockManager.java:computeReplicationWorkForBlocks(1096)) - BLOCK* Removing 
block blk_3471690017167574595_1003 from neededReplications as it has enough 
replicas. 
{code}
Here you can observe that NN is picking the same DN 127.0.0.1:54041 for 
replication who is reporting the corrupt block and excludedNodes map has only 
one DN 127.0.0.1:54681 who is having the live replica(printed on the first line 
of the logs).

Is there any way to add the DN who is reporting the corrupt block in the 
excludedNodes map ?
                
> Error in deleting block is keep on coming from DN even after the block report 
> and directory scanning has happened
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-3157
>                 URL: https://issues.apache.org/jira/browse/HDFS-3157
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: J.Andreina
>            Assignee: Ashish Singhi
>             Fix For: 2.0.0, 3.0.0
>
>         Attachments: HDFS-3157.patch, HDFS-3157.patch, HDFS-3157.patch
>
>
> Cluster setup:
> 1NN,Three DN(DN1,DN2,DN3),replication factor-2,"dfs.blockreport.intervalMsec" 
> 300,"dfs.datanode.directoryscan.interval" 1
> step 1: write one file "a.txt" with sync(not closed)
> step 2: Delete the blocks in one of the datanode say DN1(from rbw) to which 
> replication happened.
> step 3: close the file.
> Since the replication factor is 2 the blocks are replicated to the other 
> datanode.
> Then at the NN side the following cmd is issued to DN from which the block is 
> deleted
> -------------------------------------------------------------------------------------
> {noformat}
> 2012-03-19 13:41:36,905 INFO org.apache.hadoop.hdfs.StateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: duplicate requested for 
> blk_2903555284838653156 to add as corrupt on XX.XX.XX.XX by /XX.XX.XX.XX 
> because reported RBW replica with genstamp 1002 does not match COMPLETE 
> block's genstamp in block map 1003
> 2012-03-19 13:41:39,588 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> Removing block blk_2903555284838653156_1003 from neededReplications as it has 
> enough replicas.
> {noformat}
> From the datanode side in which the block is deleted the following exception 
> occured
> {noformat}
> 2012-02-29 13:54:13,126 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Unexpected error trying to delete block blk_2903555284838653156_1003. 
> BlockInfo not found in volumeMap.
> 2012-02-29 13:54:13,126 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Error processing datanode Command
> java.io.IOException: Error in deleting blocks.
>       at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset.invalidate(FSDataset.java:2061)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:581)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:545)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:690)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:522)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:662)
>       at java.lang.Thread.run(Thread.java:619)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3157) Error in deleting block is keep on coming from DN even after the block report and directory scanning has happened

Reply via email to