Here's another hypothesis. You'll have to check your namenode logs to see if it is the correct one.
In a healthy cluster, when one datanode goes down, 10 minutes later the namenode will "notice" it, and mark all its blocks as under-replicated (in namenode memory). It will then generate replication requests to other holders of those blocks, to get back to the required replication count for each. In a large cluster, this process can take as little as a few minutes, because there are many candidate senders and receivers for the copy operations. When the datanode comes back a whole day later and sends its Initial Block Report, which we'll assume still has all the blocks it did before, since the disks were not corrupted, then all those blocks will be marked as OVER-replicated. The namenode will generate delete requests for one replica for each of those blocks. Now, I would have expected those delete requests to be randomly distributed across all nodes holding replicas of those blocks. But there is some indication in the code that all the deletes may go to the new excess source, especially if the number of blocks is small (say, under one thousand). I'm not sure whether this is a sufficient explanation or not. If the block deletes were ordered by the namenode, they may be in the namenode logs, with a prefix "hdfs.StateChange: BLOCK*", like this: 11/04/12 10:00:28 INFO hdfs.StateChange: BLOCK* NameSystem.chooseExcessReplicates: (127.0.0.1:49946, blk_-44847565016350316_1001) is added to recentInvalidateSets 11/04/12 10:00:28 INFO hdfs.StateChange: BLOCK* ask 127.0.0.1:49946 to delete blk_-44847565016350316_1001 Obviously you have to be logging at INFO level for these block-level manipulations to be in your namenode logs. The datanode then echoes it with log lines like: 11/04/12 10:00:29 INFO datanode.DataNode: Scheduling block blk_-44847565016350316_1001 file .../current/finalized/blk_-44847565016350316 for deletion 11/04/12 10:00:29 INFO datanode.DataNode: Deleted block blk_-44847565016350316_1001 at file .../current/finalized/blk_-44847565016350316 Your log extract below shows that the datanode was logging at INFO level. See if your namenode was too, and if you can show that it generated a bunch of delete requests shortly after your repaired datanode came up. Cheers, --Matt On Apr 12, 2011, at 9:30 AM, felix gao wrote: mapred.local.dir is under /hadoop/mapred defs.data.dir is /hadoop./dfs. The logs showing 011-04-11 14:34:10,987 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 50020: starting 2011-04-11 14:34:10,987 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50020: starting 2011-04-11 14:34:10,988 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50020: starting 2011-04-11 14:34:10,988 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: dnRegistration = DatanodeRegistration(had41.xxx:50010, storageID=DS-922075132-69.170.130.173-50010-1297386088418, infoPort=50075, ipcPort=50020) 2011-04-11 14:34:10,988 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 50020: starting 2011-04-11 14:34:11,021 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(69.170.130.173:50010<http://69.170.130.173:50010/>, storageID=DS-922075132-69.170.130.173-50010-1297386088418, infoPort=50075, ipcPort=50020)In DataNode.run, data = FSDataset{dirpath='/hadoop/dfs/current'} 2011-04-11 14:34:11,021 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: using BLOCKREPORT_INTERVAL of 3600000msec Initial delay: 0msec 2011-04-11 14:34:15,545 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 110651 blocks got processed in 4493 msecs 2011-04-11 14:34:15,545 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Starting Periodic block scanner. 2011-04-11 14:34:15,692 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Deleting block blk_-9213431071914219029_3395500 file /hadoop/dfs/current/subdir7/subdir26/blk_-9213431071914219029 then followed by all the blocks in the dfs/current directory. Seems to me hadoop just want to invalidate all the blocks in that box by deleting all of it. On Tue, Apr 12, 2011 at 9:17 AM, Harsh J <ha...@cloudera.com<mailto:ha...@cloudera.com>> wrote: Apart from ensuring all that others here have said, are your mapred.local.dir and dfs.data.dir pointing to the same directory by any chance? If that so happens, the tasktracker could potentially wipe out the datanode directories when restarted. On Tue, Apr 12, 2011 at 8:16 PM, felix gao <gre1...@gmail.com<mailto:gre1...@gmail.com>> wrote: > What reason/condition would cause a datanode’s blocks to be removed? Our > cluster had a one of its datanodes crash because of bad RAM. After the > system was upgraded and the datanode/tasktracker brought online the next day > we noticed the amount of space utilized was minimal and the cluster was > rebalancing blocks to the datanode. It would seem the prior blocks were > removed. Was this because the datanode was declared dead? What is the > criteria for a namenode to decide (Assuming its the namenode) when a > datanode should remove prior blocks? -- Harsh J