[ https://issues.apache.org/jira/browse/YARN-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15335408#comment-15335408 ]
Wangda Tan commented on YARN-5214: ---------------------------------- Thanks [~djp], I think the patch added RW lock can generally reduce the time spent on locking. However, I think this may not be able to solve the entire problem. Per my understanding, even after R/W lock changes, when anything bad happens on disks, DirectoryCollection will be stuck under write locks, so NodeStatusUpdater will be blocked as well. I think there're two fixes that we can make to tackle the problem: 1) In short term, errorDirs/fullDirs/localDirs are copy-on-write list, so we don't need to acquire lock getGoodDirs/getFailedDirs/getFailedDirs. This could lead to inconsistency data in rare cases, but I think in general this is safe and inconsistency data will be updated in next heartbeat. 2) In longer term, we may need to consider a DirectoryCollection stuck under busy IO is unhealthy state, NodeStatusUpdater should be able to report such status to RM, so RM will avoid allocating any new containers to such nodes. [~nroberts] suggested the same thing. Thoughts? > Pending on synchronized method DirectoryCollection#checkDirs can hang NM's > NodeStatusUpdater > -------------------------------------------------------------------------------------------- > > Key: YARN-5214 > URL: https://issues.apache.org/jira/browse/YARN-5214 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Reporter: Junping Du > Assignee: Junping Du > Priority: Critical > Attachments: YARN-5214.patch > > > In one cluster, we notice NM's heartbeat to RM is suddenly stopped and wait a > while and marked LOST by RM. From the log, the NM daemon is still running, > but jstack hints NM's NodeStatusUpdater thread get blocked: > 1. Node Status Updater thread get blocked by 0x000000008065eae8 > {noformat} > "Node Status Updater" #191 prio=5 os_prio=0 tid=0x00007f0354194000 nid=0x26fa > waiting for monitor entry [0x00007f035945a000] > java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.getFailedDirs(DirectoryCollection.java:170) > - waiting to lock <0x000000008065eae8> (a > org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection) > at > org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getDisksHealthReport(LocalDirsHandlerService.java:287) > at > org.apache.hadoop.yarn.server.nodemanager.NodeHealthCheckerService.getHealthReport(NodeHealthCheckerService.java:58) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.getNodeStatus(NodeStatusUpdaterImpl.java:389) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.access$300(NodeStatusUpdaterImpl.java:83) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:643) > at java.lang.Thread.run(Thread.java:745) > {noformat} > 2. The actual holder of this lock is DiskHealthMonitor: > {noformat} > "DiskHealthMonitor-Timer" #132 daemon prio=5 os_prio=0 tid=0x00007f0397393000 > nid=0x26bd runnable [0x00007f035e511000] > java.lang.Thread.State: RUNNABLE > at java.io.UnixFileSystem.createDirectory(Native Method) > at java.io.File.mkdir(File.java:1316) > at > org.apache.hadoop.util.DiskChecker.mkdirsWithExistsCheck(DiskChecker.java:67) > at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:104) > at > org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.verifyDirUsingMkdir(DirectoryCollection.java:340) > at > org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.testDirs(DirectoryCollection.java:312) > at > org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.checkDirs(DirectoryCollection.java:231) > - locked <0x000000008065eae8> (a > org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection) > at > org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.checkDirs(LocalDirsHandlerService.java:389) > at > org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.access$400(LocalDirsHandlerService.java:50) > at > org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService$MonitoringTimerTask.run(LocalDirsHandlerService.java:122) > at java.util.TimerThread.mainLoop(Timer.java:555) > at java.util.TimerThread.run(Timer.java:505) > {noformat} > This disk operation could take longer time than expectation especially in > high IO throughput case and we should have fine-grained lock for related > operations here. > The same issue on HDFS get raised and fixed in HDFS-7489, and we probably > should have similar fix here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org