[jira] [Commented] (HDFS-5225) datanode keeps logging the same 'is no longer in the dataset' message over and over again

Junping Du (JIRA) Wed, 18 Sep 2013 19:15:12 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13771520#comment-13771520
 ]


Junping Du commented on HDFS-5225:
----------------------------------

Hi Colin and Roman, I think the following code won't delete block from 
blockInfoSet if block is already removed from blockMap:
{code}
  /** Deletes the block from internal structures */
  synchronized void deleteBlock(Block block) {
    BlockScanInfo info = blockMap.get(block);
    if ( info != null ) {
      delBlockInfo(info);
    }
  }
{code}
Then, if the block is happened to be the first block on blockInfoSet, then log 
will loop forever.
The reason of inconsistent between blockMap and blockInfoSet may because 
blockInfoSet is an unsynchronized TreeSet so is not thread-safe. May be we 
should replace it with ConcurrentSkipListMap (which keep order and 
concurrency). Thoughts?
                
> datanode keeps logging the same 'is no longer in the dataset' message over 
> and over again
> -----------------------------------------------------------------------------------------
>
>                 Key: HDFS-5225
>                 URL: https://issues.apache.org/jira/browse/HDFS-5225
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.1.1-beta
>            Reporter: Roman Shaposhnik
>
> I was running the usual Bigtop testing on 2.1.1-beta RC1 with the following 
> configuration: 4 nodes fully distributed cluster with security on.
> All of a sudden my DN ate up all of the space in /var/log logging the 
> following message repeatedly:
> {noformat}
> 2013-09-18 20:51:12,046 INFO 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: 
> BP-1884637155-10.224.158.152-1379524544853:blk_1073742189_1369 is no longer 
> in the dataset
> {noformat}
> It wouldn't answer to a jstack and jstack -F ended up being useless.
> Here's what I was able to find in the NameNode logs regarding this block ID:
> {noformat}
> fgrep -rI 'blk_1073742189' hadoop-hdfs-namenode-ip-10-224-158-152.log
> 2013-09-18 18:03:16,972 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> allocateBlock: 
> /user/jenkins/testAppendInputWedSep18180222UTC2013/test4.fileWedSep18180222UTC2013._COPYING_.
>  BP-1884637155-10.224.158.152-1379524544853 
> blk_1073742189_1369{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, 
> replicas=[ReplicaUnderConstruction[10.83.107.80:1004|RBW], 
> ReplicaUnderConstruction[10.34.74.206:1004|RBW], 
> ReplicaUnderConstruction[10.224.158.152:1004|RBW]]}
> 2013-09-18 18:03:17,222 INFO BlockStateChange: BLOCK* addStoredBlock: 
> blockMap updated: 10.224.158.152:1004 is added to 
> blk_1073742189_1369{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, 
> replicas=[ReplicaUnderConstruction[10.83.107.80:1004|RBW], 
> ReplicaUnderConstruction[10.34.74.206:1004|RBW], 
> ReplicaUnderConstruction[10.224.158.152:1004|RBW]]} size 0
> 2013-09-18 18:03:17,222 INFO BlockStateChange: BLOCK* addStoredBlock: 
> blockMap updated: 10.34.74.206:1004 is added to 
> blk_1073742189_1369{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, 
> replicas=[ReplicaUnderConstruction[10.83.107.80:1004|RBW], 
> ReplicaUnderConstruction[10.34.74.206:1004|RBW], 
> ReplicaUnderConstruction[10.224.158.152:1004|RBW]]} size 0
> 2013-09-18 18:03:17,224 INFO BlockStateChange: BLOCK* addStoredBlock: 
> blockMap updated: 10.83.107.80:1004 is added to 
> blk_1073742189_1369{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, 
> replicas=[ReplicaUnderConstruction[10.83.107.80:1004|RBW], 
> ReplicaUnderConstruction[10.34.74.206:1004|RBW], 
> ReplicaUnderConstruction[10.224.158.152:1004|RBW]]} size 0
> 2013-09-18 18:03:17,899 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
> updatePipeline(block=BP-1884637155-10.224.158.152-1379524544853:blk_1073742189_1369,
>  newGenerationStamp=1370, newLength=1048576, newNodes=[10.83.107.80:1004, 
> 10.34.74.206:1004, 10.224.158.152:1004], 
> clientName=DFSClient_NONMAPREDUCE_-450304083_1)
> 2013-09-18 18:03:17,904 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
> updatePipeline(BP-1884637155-10.224.158.152-1379524544853:blk_1073742189_1369)
>  successfully to 
> BP-1884637155-10.224.158.152-1379524544853:blk_1073742189_1370
> 2013-09-18 18:03:18,540 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
> updatePipeline(block=BP-1884637155-10.224.158.152-1379524544853:blk_1073742189_1370,
>  newGenerationStamp=1371, newLength=2097152, newNodes=[10.83.107.80:1004, 
> 10.34.74.206:1004, 10.224.158.152:1004], 
> clientName=DFSClient_NONMAPREDUCE_-450304083_1)
> 2013-09-18 18:03:18,548 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
> updatePipeline(BP-1884637155-10.224.158.152-1379524544853:blk_1073742189_1370)
>  successfully to 
> BP-1884637155-10.224.158.152-1379524544853:blk_1073742189_1371
> 2013-09-18 18:03:26,150 INFO BlockStateChange: BLOCK* addToInvalidates: 
> blk_1073742189_1371 10.83.107.80:1004 10.34.74.206:1004 10.224.158.152:1004 
> 2013-09-18 18:03:26,847 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> InvalidateBlocks: ask 10.34.74.206:1004 to delete [blk_1073742178_1359, 
> blk_1073742183_1362, blk_1073742184_1363, blk_1073742186_1366, 
> blk_1073742188_1368, blk_1073742189_1371]
> 2013-09-18 18:03:29,848 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> InvalidateBlocks: ask 10.224.158.152:1004 to delete [blk_1073742177_1353, 
> blk_1073742178_1359, blk_1073742179_1355, blk_1073742180_1356, 
> blk_1073742181_1358, blk_1073742182_1361, blk_1073742185_1364, 
> blk_1073742187_1367, blk_1073742188_1368, blk_1073742189_1371]
> 2013-09-18 18:03:29,848 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> InvalidateBlocks: ask 10.83.107.80:1004 to delete [blk_1073742177_1353, 
> blk_1073742178_1359, blk_1073742179_1355, blk_1073742180_1356, 
> blk_1073742181_1358, blk_1073742182_1361, blk_1073742183_1362, 
> blk_1073742184_1363, blk_1073742185_1364, blk_1073742186_1366, 
> blk_1073742187_1367, blk_1073742188_1368, blk_1073742189_1371]
> {noformat}
> This seems to suggest that the block was successfully deleted, but then DN 
> got into a death spiral inside of a scanner.
> I can keep the cluster running for a few days if anybody is willing to take a 
> look. Ask me for creds via a personal email.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-5225) datanode keeps logging the same 'is no longer in the dataset' message over and over again

Reply via email to