[
https://issues.apache.org/jira/browse/HDFS-17798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18023351#comment-18023351
]
ASF GitHub Bot commented on HDFS-17798:
---------------------------------------
github-actions[bot] commented on PR #7749:
URL: https://github.com/apache/hadoop/pull/7749#issuecomment-3342139821
We're closing this stale PR because it has been open for 100 days with no
activity. This isn't a judgement on the merit of the PR in any way. It's just a
way of keeping the PR queue manageable.
If you feel like this was a mistake, or you would like to continue working
on it, please feel free to re-open it and ask for a committer to remove the
stale tag and review again.
Thanks all for your contribution.
> The problem that bad replicas in the mini cluster cannot be automatically
> replicated
> ------------------------------------------------------------------------------------
>
> Key: HDFS-17798
> URL: https://issues.apache.org/jira/browse/HDFS-17798
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: block placement
> Affects Versions: 3.3.6
> Reporter: kuper
> Assignee: kuper
> Priority: Major
> Labels: pull-request-available
>
> * In a 3-datanode cluster with a 3-replica block, if one replica on a node
> becomes corrupted (and this corruption did not occur during the write
> process), it will result in:
> ** The corrupted replica cannot be removed from the damaged node.
> ** Due to the missing replica, replication reconstruction tasks will
> continuously attempt to replicate the corrupted replica.
> ** However, during reconstruction, nodes already hosting a replica of this
> block are excluded—meaning all 3 datanodes are excluded.
> ** This prevents the selection of a suitable target node for replication,
> eventually creating a {*}vicious cycle{*}.
> *reproduction*
> * Or execute in the 3datanode cluster in the following order
> ** Find a normal block with three replicas and destroy the replica file of
> one of its Datanodes
> ** When waiting for the datanode disk scan cycle, it will be found that
> there is a damaged copy and it cannot be rebuilt by other copies
> * In the TestBlockManager add
> testMiniClusterCannotReconstructionWhileReplicaAnomaly
>
> {code:java}
> @Test(timeout = 60000)
> public void testMiniClusterCannotReconstructionWhileReplicaAnomaly()
> throws IOException, InterruptedException, TimeoutException {
> Configuration conf = new HdfsConfiguration();
> conf.setInt("dfs.datanode.directoryscan.interval",
> DN_DIRECTORYSCAN_INTERVAL);
> conf.setInt("dfs.namenode.replication.interval", 1);
> conf.setInt("dfs.heartbeat.interval", 1);
> String src = "/test-reconstruction";
> Path file = new Path(src);
> MiniDFSCluster cluster = new
> MiniDFSCluster.Builder(conf).numDataNodes(3).build();
> try {
> cluster.waitActive();
> FSNamesystem fsn = cluster.getNamesystem();
> BlockManager bm = fsn.getBlockManager();
>
> FSDataOutputStream out = null;
> FileSystem fs = cluster.getFileSystem();
> try {
> out = fs.create(file);
> for (int i = 0; i < 1024 * 1; i++) {
> out.write(i);
> }
> out.hflush();
> } finally {
> IOUtils.closeStream(out);
> }
>
> FSDataInputStream in = null;
> ExtendedBlock oldBlock = null;
> try {
> in = fs.open(file);
> oldBlock = DFSTestUtil.getAllBlocks(in).get(0).getBlock();
> } finally {
> IOUtils.closeStream(in);
> }
> DataNode dn = cluster.getDataNodes().get(0);
> String blockPath =
> dn.getFSDataset().getBlockLocalPathInfo(oldBlock).getBlockPath();
> String metaBlockPath =
> dn.getFSDataset().getBlockLocalPathInfo(oldBlock).getMetaPath();
> Files.write(Paths.get(blockPath), Collections.emptyList());
> Files.write(Paths.get(metaBlockPath), Collections.emptyList());
> cluster.restartDataNode(0, true);
> cluster.waitDatanodeConnectedToActive(dn, 60000);
> while(!dn.isDatanodeFullyStarted()) {
> Thread.sleep(1000);
> }
> Thread.sleep(DN_DIRECTORYSCAN_INTERVAL * 1000);
> cluster.triggerBlockReports();
> BlockInfo bi = bm.getStoredBlock(oldBlock.getLocalBlock());
> assertTrue(bm.isNeededReconstruction(bi,
> bm.countNodes(bi, cluster.getNamesystem().isInStartupSafeMode())));
> BlockReconstructionWork reconstructionWork = null;
> fsn.readLock();
> try {
> reconstructionWork = bm.scheduleReconstruction(bi, 3);
> } finally {
> fsn.readUnlock();
> }
> assertNotNull(reconstructionWork);
> assertEquals(reconstructionWork.getContainingNodes().size(), 3);
>
> } finally {
> if (cluster != null) {
> cluster.shutdown();
> }
> }
> } {code}
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]