[jira] [Work logged] (HDFS-16583) DatanodeAdminDefaultMonitor can get stuck in an infinite loop

ASF GitHub Bot (Jira) Fri, 20 May 2022 12:07:25 -0700


     [ 
https://issues.apache.org/jira/browse/HDFS-16583?focusedWorklogId=772991&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-772991
 ]


ASF GitHub Bot logged work on HDFS-16583:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 20/May/22 18:58
            Start Date: 20/May/22 18:58
    Worklog Time Spent: 10m 
      Work Description: hadoop-yetus commented on PR #4332:
URL: https://github.com/apache/hadoop/pull/4332#issuecomment-1133217136

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |:----:|----------:|--------:|:--------:|:-------:|
   | +0 :ok: |  reexec  |   1m  3s |  |  Docker mode activated.  |
   |||| _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
   |||| _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  37m 50s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 44s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  compile  |   1m 33s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  checkstyle  |   1m 24s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 44s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 25s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 45s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   3m 45s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  23m 14s |  |  branch has no errors 
when building and testing our client artifacts.  |
   |||| _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 21s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 25s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javac  |   1m 25s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 19s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  javac  |   1m 19s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 59s | 
[/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4332/2/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs-project/hadoop-hdfs: The patch generated 2 new + 6 unchanged - 
0 fixed = 8 total (was 6)  |
   | +1 :green_heart: |  mvnsite  |   1m 25s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 57s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 30s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   3m 21s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  22m 39s |  |  patch has no errors 
when building and testing our client artifacts.  |
   |||| _ Other Tests _ |
   | -1 :x: |  unit  | 432m 39s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4332/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   1m 13s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 542m 33s |  |  |
   
   
   | Reason | Tests |
   |-------:|:------|
   | Failed junit tests | hadoop.hdfs.server.namenode.ha.TestSeveralNameNodes |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4332/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4332 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux 56a95fee91d3 4.15.0-65-generic #74-Ubuntu SMP Tue Sep 17 
17:06:04 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 18f25f515a0f64ec80ba590de915bf615c4f27cc |
   | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private 
Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4332/2/testReport/ |
   | Max. process+thread count | 2466 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4332/2/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




Issue Time Tracking
-------------------

    Worklog Id:     (was: 772991)
    Time Spent: 50m  (was: 40m)

> DatanodeAdminDefaultMonitor can get stuck in an infinite loop
> -------------------------------------------------------------
>
>                 Key: HDFS-16583
>                 URL: https://issues.apache.org/jira/browse/HDFS-16583
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Stephen O'Donnell
>            Assignee: Stephen O'Donnell
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> We encountered a case where the decommission monitor in the namenode got 
> stuck for about 6 hours. The logs give:
> {code}
> 2022-05-15 01:09:25,490 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping 
> maintenance of dead node 10.185.3.132:50010
> 2022-05-15 01:10:20,918 INFO org.apache.hadoop.http.HttpServer2: Process 
> Thread Dump: jsp requested
> <Thread dump>
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753665_3428271426
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753659_3428271420
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753662_3428271423
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753663_3428271424
> 2022-05-15 06:00:57,281 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping 
> maintenance of dead node 10.185.3.34:50010
> 2022-05-15 06:00:58,105 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write lock 
> held for 17492614 ms via
> java.lang.Thread.getStackTrace(Thread.java:1559)
> org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:263)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:220)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1601)
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:496)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
>       Number of suppressed write-lock reports: 0
>       Longest write-lock held interval: 17492614
> {code}
> We only have the one thread dump triggered by the FC:
> {code}
> Thread 80 (DatanodeAdminMonitor-0):
>   State: RUNNABLE
>   Blocked count: 16
>   Waited count: 453693
>   Stack:
>     
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.check(DatanodeAdminManager.java:538)
>     
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:494)
>     java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>     
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>     
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>     
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     java.lang.Thread.run(Thread.java:748)
> {code}
> This was the line of code:
> {code}
>     private void check() {
>       final Iterator<Map.Entry<DatanodeDescriptor, AbstractList<BlockInfo>>>
>           it = new CyclicIteration<>(outOfServiceNodeBlocks,
>               iterkey).iterator();
>       final LinkedList<DatanodeDescriptor> toRemove = new LinkedList<>();
>       while (it.hasNext() && !exceededNumBlocksPerCheck() && namesystem
>           .isRunning()) {
>         numNodesChecked++;
>         final Map.Entry<DatanodeDescriptor, AbstractList<BlockInfo>>
>             entry = it.next();
>         final DatanodeDescriptor dn = entry.getKey();
>         AbstractList<BlockInfo> blocks = entry.getValue();
>         boolean fullScan = false;
>         if (dn.isMaintenance() && dn.maintenanceExpired()) {
>           // If maintenance expires, stop tracking it.
>           stopMaintenance(dn);
>           toRemove.add(dn);
>           continue;
>         }
>         if (dn.isInMaintenance()) {
>           // The dn is IN_MAINTENANCE and the maintenance hasn't expired yet.
>           continue;  //// >>> This line
>         }
> {code}
> With only one sample, we cannot figure out for sure if it is somehow stuck in 
> an infinite loop, but I suspect it is.
> The problem is two fold:
> 1) When we call stopMaintenance(dn), which we must have done as it logged the 
> "Stopping maintenance of dead node", the code looks like:
> {code}
>        if (dn.isMaintenance() && dn.maintenanceExpired()) {
>           // If maintenance expires, stop tracking it.
>           stopMaintenance(dn);
>           toRemove.add(dn);
>           continue;
>         }
> {code}
> Notice we add the DN to the toRemove list - this is to avoid concurrent 
> modifications on the outOfServiceNodeBlocks map. However inside 
> stopMaintenance we see:
> {code}
>       pendingNodes.remove(node);
>       outOfServiceNodeBlocks.remove(node);
> {code}      
> This should cause a concurrentModification exception, which I think will 
> leave the decommission monitor broken until the NN is restarted normally 
> (again I am not sure).
> 2) In some cases a ConcurrentModificationException is not thrown. This is 
> because the iterator is a CyclicIterator. If the DN removed from maintenance 
> is the one after the iterator start key, and its the second last element in 
> the navigatable map, I believe the CyclicIterator will go into an infinite 
> loop. No concurrent modification exception will get thrown, as the TreeMap 
> iterator has no next(), so it creates a new TreeMap iterator, and then the 
> "stop key" for the cyclic iterator is never found in the map as its now gone, 
> and it just cycles forever. I think that is what happened here.
> This test in TestCyclicIteration proves the infinite loop is possible:
> {code}
>   @Test
>   public void testInfiniteLoop() {
>     final NavigableMap<Integer, Integer> map = new TreeMap<Integer, 
> Integer>();
>     map.put(0, 0);
>     map.put(2, 2);
>     map.put(4, 4);
>     map.put(6, 6);
>     System.out.println("map=" + map);
>     // Create iterator to start at 2, which means the first element returned
>     // should be 4.
>     // Inside the treemap iterator, next will already be set to 6
>     // after 4 is returned.
>     final Iterator<Map.Entry<Integer, Integer>>
>         it =  new CyclicIteration<>(map, 2).iterator();
>     int i = 0;
>     while(it.hasNext()) {
>       Map.Entry<Integer, Integer> e = it.next();
>       if (i == 0) {
>         map.remove(4);
>       }
>       System.out.println("Iteration #" + i++);
>     }
>   }
> {code}
> The root cause of the decommission monitor problem is the 
> concurrentModification of the outOfServiceNodeBlocks - so we should avoid 
> that.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDFS-16583) DatanodeAdminDefaultMonitor can get stuck in an infinite loop

Reply via email to