Hi Hadoopers, I opened a discussion on the core-users list about the replication level. Whenever a data node is dead, all the blocks (files) contained in that node can be considered as lost??
And if that node never gets back again or at least it takes a while (long long time) till is ready again. Some files can get their replication level compromised. Shouldn't exist a daemon or being part of the name node server's responsibilities to recover from that failure. My point is that whenever a data node is gone a replication process should be started in order to restore the replication level to all those files which have lost 1 replica. Then the replication level would be guaranteed. If the fault node is back again during the recovery process. It should not be considered as part of the data nodes group until this process is over. Then the file system would add the data node and free all the contained blocks in that data node. Or enable the node to join and delete only the files who has been modified during the time that node was down and delete also the files which have been already replicated. It would save time and bandwidth, but the process would be more complex. Cheers Alfonso
