[jira] [Resolved] (HDFS-16583) DatanodeAdminDefaultMonitor can get stuck in an infinite loop
[ https://issues.apache.org/jira/browse/HDFS-16583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang resolved HDFS-16583. Resolution: Fixed > DatanodeAdminDefaultMonitor can get stuck in an infinite loop > - > > Key: HDFS-16583 > URL: https://issues.apache.org/jira/browse/HDFS-16583 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.2.4, 3.3.4 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > We encountered a case where the decommission monitor in the namenode got > stuck for about 6 hours. The logs give: > {code} > 2022-05-15 01:09:25,490 INFO > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping > maintenance of dead node 10.185.3.132:50010 > 2022-05-15 01:10:20,918 INFO org.apache.hadoop.http.HttpServer2: Process > Thread Dump: jsp requested > > 2022-05-15 01:19:06,810 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: > PendingReconstructionMonitor timed out blk_4501753665_3428271426 > 2022-05-15 01:19:06,810 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: > PendingReconstructionMonitor timed out blk_4501753659_3428271420 > 2022-05-15 01:19:06,810 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: > PendingReconstructionMonitor timed out blk_4501753662_3428271423 > 2022-05-15 01:19:06,810 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: > PendingReconstructionMonitor timed out blk_4501753663_3428271424 > 2022-05-15 06:00:57,281 INFO > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping > maintenance of dead node 10.185.3.34:50010 > 2022-05-15 06:00:58,105 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write lock > held for 17492614 ms via > java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:263) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:220) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1601) > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:496) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > Number of suppressed write-lock reports: 0 > Longest write-lock held interval: 17492614 > {code} > We only have the one thread dump triggered by the FC: > {code} > Thread 80 (DatanodeAdminMonitor-0): > State: RUNNABLE > Blocked count: 16 > Waited count: 453693 > Stack: > > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.check(DatanodeAdminManager.java:538) > > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:494) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > {code} > This was the line of code: > {code} > private void check() { > final Iterator>> > it = new CyclicIteration<>(outOfServiceNodeBlocks, > iterkey).iterator(); > final LinkedList toRemove = new LinkedList<>(); > while (it.hasNext() && !exceededNumBlocksPerCheck() && namesystem > .isRunning()) { > numNodesChecked++; > final Map.Entry> > entry = it.next(); > final DatanodeDescriptor dn = entry.getKey(); > AbstractList blocks = entry.getValue(); > boolean fullScan = false; > if (dn.isMaintenance() &&
[jira] [Resolved] (HDFS-16583) DatanodeAdminDefaultMonitor can get stuck in an infinite loop
[ https://issues.apache.org/jira/browse/HDFS-16583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang resolved HDFS-16583. Resolution: Fixed > DatanodeAdminDefaultMonitor can get stuck in an infinite loop > - > > Key: HDFS-16583 > URL: https://issues.apache.org/jira/browse/HDFS-16583 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.3.4 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > We encountered a case where the decommission monitor in the namenode got > stuck for about 6 hours. The logs give: > {code} > 2022-05-15 01:09:25,490 INFO > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping > maintenance of dead node 10.185.3.132:50010 > 2022-05-15 01:10:20,918 INFO org.apache.hadoop.http.HttpServer2: Process > Thread Dump: jsp requested > > 2022-05-15 01:19:06,810 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: > PendingReconstructionMonitor timed out blk_4501753665_3428271426 > 2022-05-15 01:19:06,810 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: > PendingReconstructionMonitor timed out blk_4501753659_3428271420 > 2022-05-15 01:19:06,810 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: > PendingReconstructionMonitor timed out blk_4501753662_3428271423 > 2022-05-15 01:19:06,810 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: > PendingReconstructionMonitor timed out blk_4501753663_3428271424 > 2022-05-15 06:00:57,281 INFO > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping > maintenance of dead node 10.185.3.34:50010 > 2022-05-15 06:00:58,105 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write lock > held for 17492614 ms via > java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:263) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:220) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1601) > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:496) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > Number of suppressed write-lock reports: 0 > Longest write-lock held interval: 17492614 > {code} > We only have the one thread dump triggered by the FC: > {code} > Thread 80 (DatanodeAdminMonitor-0): > State: RUNNABLE > Blocked count: 16 > Waited count: 453693 > Stack: > > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.check(DatanodeAdminManager.java:538) > > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:494) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > {code} > This was the line of code: > {code} > private void check() { > final Iterator>> > it = new CyclicIteration<>(outOfServiceNodeBlocks, > iterkey).iterator(); > final LinkedList toRemove = new LinkedList<>(); > while (it.hasNext() && !exceededNumBlocksPerCheck() && namesystem > .isRunning()) { > numNodesChecked++; > final Map.Entry> > entry = it.next(); > final DatanodeDescriptor dn = entry.getKey(); > AbstractList blocks = entry.getValue(); > boolean fullScan = false; > if (dn.isMaintenance() &&