kuper created HDFS-17798:
----------------------------
Summary: The problem that bad replicas in the mini cluster cannot
be automatically replicated
Key: HDFS-17798
URL: https://issues.apache.org/jira/browse/HDFS-17798
Project: Hadoop HDFS
Issue Type: Bug
Components: block placement
Affects Versions: 3.3.6
Reporter: kuper
Assignee: kuper
* In a 3-datanode cluster with a 3-replica block, if one replica on a node
becomes corrupted (and this corruption did not occur during the write process),
it will result in:
*
** The corrupted replica cannot be removed from the damaged node.
** Due to the missing replica, replication reconstruction tasks will
continuously attempt to replicate the corrupted replica.
** However, during reconstruction, nodes already hosting a replica of this
block are excluded—meaning all 3 datanodes are excluded.
** This prevents the selection of a suitable target node for replication,
eventually creating a {*}vicious cycle{*}.
*reproduction*
* Or execute in the 3datanode cluster in the following order
** Find a normal block with three replicas and destroy the replica file of one
of its Datanodes
**
When waiting for the datanode disk scan cycle, it will be found that there is a
damaged copy and it cannot be rebuilt by other copies
* In the TestBlockManager add
testMiniClusterCannotReconstructionWhileReplicaAnomaly
{code:java}
@Test(timeout = 60000)
public void testMiniClusterCannotReconstructionWhileReplicaAnomaly()
throws IOException, InterruptedException, TimeoutException {
Configuration conf = new HdfsConfiguration();
conf.setInt("dfs.datanode.directoryscan.interval",
DN_DIRECTORYSCAN_INTERVAL);
conf.setInt("dfs.namenode.replication.interval", 1);
conf.setInt("dfs.heartbeat.interval", 1);
String src = "/test-reconstruction";
Path file = new Path(src);
MiniDFSCluster cluster = new
MiniDFSCluster.Builder(conf).numDataNodes(3).build();
try {
cluster.waitActive();
FSNamesystem fsn = cluster.getNamesystem();
BlockManager bm = fsn.getBlockManager();
FSDataOutputStream out = null;
FileSystem fs = cluster.getFileSystem();
try {
out = fs.create(file);
for (int i = 0; i < 1024 * 1; i++) {
out.write(i);
}
out.hflush();
} finally {
IOUtils.closeStream(out);
}
FSDataInputStream in = null;
ExtendedBlock oldBlock = null;
try {
in = fs.open(file);
oldBlock = DFSTestUtil.getAllBlocks(in).get(0).getBlock();
} finally {
IOUtils.closeStream(in);
}
DataNode dn = cluster.getDataNodes().get(0);
String blockPath =
dn.getFSDataset().getBlockLocalPathInfo(oldBlock).getBlockPath();
String metaBlockPath =
dn.getFSDataset().getBlockLocalPathInfo(oldBlock).getMetaPath();
Files.write(Paths.get(blockPath), Collections.emptyList());
Files.write(Paths.get(metaBlockPath), Collections.emptyList());
cluster.restartDataNode(0, true);
cluster.waitDatanodeConnectedToActive(dn, 60000);
while(!dn.isDatanodeFullyStarted()) {
Thread.sleep(1000);
}
Thread.sleep(DN_DIRECTORYSCAN_INTERVAL * 1000);
cluster.triggerBlockReports();
BlockInfo bi = bm.getStoredBlock(oldBlock.getLocalBlock());
assertTrue(bm.isNeededReconstruction(bi,
bm.countNodes(bi, cluster.getNamesystem().isInStartupSafeMode())));
BlockReconstructionWork reconstructionWork = null;
fsn.readLock();
try {
reconstructionWork = bm.scheduleReconstruction(bi, 3);
} finally {
fsn.readUnlock();
}
assertNotNull(reconstructionWork);
assertEquals(reconstructionWork.getContainingNodes().size(), 3);
} finally {
if (cluster != null) {
cluster.shutdown();
}
}
} {code}
* Or execute in the 3datanode cluster in the following order
**
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]