Here's another hypothesis.  You'll have to check your namenode logs to see if 
it is the correct one.

In a healthy cluster, when one datanode goes down, 10 minutes later the 
namenode will "notice" it, and mark all its blocks as under-replicated (in 
namenode memory).  It will then generate replication requests to other holders 
of those blocks, to get back to the required replication count for each.  In a 
large cluster, this process can take as little as a few minutes, because there 
are many candidate senders and receivers for the copy operations.

When the datanode comes back a whole day later and sends its Initial Block 
Report, which we'll assume still has all the blocks it did before, since the 
disks were not corrupted, then all those blocks will be marked as 
OVER-replicated.  The namenode will generate delete requests for one replica 
for each of those blocks.

Now, I would have expected those delete requests to be randomly distributed 
across all nodes holding replicas of those blocks.  But there is some 
indication in the code that all the deletes may go to the new excess source, 
especially if the number of blocks is small (say, under one thousand).  I'm not 
sure whether this is a sufficient explanation or not.

If the block deletes were ordered by the namenode, they may be in the namenode 
logs, with a prefix "hdfs.StateChange: BLOCK*", like this:
11/04/12 10:00:28 INFO hdfs.StateChange: BLOCK* 
NameSystem.chooseExcessReplicates: (127.0.0.1:49946, 
blk_-44847565016350316_1001) is added to recentInvalidateSets
11/04/12 10:00:28 INFO hdfs.StateChange: BLOCK* ask 127.0.0.1:49946 to delete  
blk_-44847565016350316_1001

Obviously you have to be logging at INFO level for these block-level 
manipulations to be in your namenode logs.
The datanode then echoes it with log lines like:
11/04/12 10:00:29 INFO datanode.DataNode: Scheduling block 
blk_-44847565016350316_1001 file .../current/finalized/blk_-44847565016350316 
for deletion
11/04/12 10:00:29 INFO datanode.DataNode: Deleted block 
blk_-44847565016350316_1001 at file .../current/finalized/blk_-44847565016350316

Your log extract below shows that the datanode was logging at INFO level.  See 
if your namenode was too, and if you can show that it generated a bunch of 
delete requests shortly after your repaired datanode came up.

Cheers,
--Matt


On Apr 12, 2011, at 9:30 AM, felix gao wrote:

mapred.local.dir is under /hadoop/mapred  defs.data.dir is /hadoop./dfs. The 
logs showing

011-04-11 14:34:10,987 INFO org.apache.hadoop.ipc.Server: IPC Server listener 
on 50020: starting
2011-04-11 14:34:10,987 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 
on 50020: starting
2011-04-11 14:34:10,988 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 
on 50020: starting
2011-04-11 14:34:10,988 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
dnRegistration = DatanodeRegistration(had41.xxx:50010, 
storageID=DS-922075132-69.170.130.173-50010-1297386088418, infoPort=50075, 
ipcPort=50020)
2011-04-11 14:34:10,988 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 
on 50020: starting
2011-04-11 14:34:11,021 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(69.170.130.173:50010<http://69.170.130.173:50010/>, 
storageID=DS-922075132-69.170.130.173-50010-1297386088418, infoPort=50075, 
ipcPort=50020)In DataNode.run, data = FSDataset{dirpath='/hadoop/dfs/current'}
2011-04-11 14:34:11,021 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
using BLOCKREPORT_INTERVAL of 3600000msec Initial delay: 0msec
2011-04-11 14:34:15,545 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
BlockReport of 110651 blocks got processed in 4493 msecs
2011-04-11 14:34:15,545 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Starting Periodic block scanner.
2011-04-11 14:34:15,692 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Deleting block blk_-9213431071914219029_3395500 file 
/hadoop/dfs/current/subdir7/subdir26/blk_-9213431071914219029

then followed by all the blocks in the dfs/current directory.  Seems to me 
hadoop just want to invalidate all the blocks in that box by deleting all of it.


On Tue, Apr 12, 2011 at 9:17 AM, Harsh J 
<ha...@cloudera.com<mailto:ha...@cloudera.com>> wrote:
Apart from ensuring all that others here have said, are your
mapred.local.dir and dfs.data.dir pointing to the same directory by
any chance? If that so happens, the tasktracker could potentially wipe
out the datanode directories when restarted.

On Tue, Apr 12, 2011 at 8:16 PM, felix gao 
<gre1...@gmail.com<mailto:gre1...@gmail.com>> wrote:
> What reason/condition would cause a datanode’s blocks to be removed?   Our
> cluster had a one of its datanodes crash because of bad RAM.   After the
> system was upgraded and the datanode/tasktracker brought online the next day
> we noticed the amount of space utilized was minimal and the cluster was
> rebalancing blocks to the datanode.   It would seem the prior blocks were
> removed.   Was this because the datanode was declared dead?   What is the
> criteria for a namenode to decide (Assuming its the namenode) when a
> datanode should remove prior blocks?

--
Harsh J


Reply via email to