[ https://issues.apache.org/jira/browse/YARN-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15363283#comment-15363283 ]
Vinod Kumar Vavilapalli commented on YARN-5214: ----------------------------------------------- The latest patch looks good to me. +1. Manually rekicking Jenkins, as the patch has been around for a while and trunk may have moved on. > Pending on synchronized method DirectoryCollection#checkDirs can hang NM's > NodeStatusUpdater > -------------------------------------------------------------------------------------------- > > Key: YARN-5214 > URL: https://issues.apache.org/jira/browse/YARN-5214 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Reporter: Junping Du > Assignee: Junping Du > Priority: Critical > Attachments: YARN-5214-v2.patch, YARN-5214-v3.patch, YARN-5214.patch > > > In one cluster, we notice NM's heartbeat to RM is suddenly stopped and wait a > while and marked LOST by RM. From the log, the NM daemon is still running, > but jstack hints NM's NodeStatusUpdater thread get blocked: > 1. Node Status Updater thread get blocked by 0x000000008065eae8 > {noformat} > "Node Status Updater" #191 prio=5 os_prio=0 tid=0x00007f0354194000 nid=0x26fa > waiting for monitor entry [0x00007f035945a000] > java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.getFailedDirs(DirectoryCollection.java:170) > - waiting to lock <0x000000008065eae8> (a > org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection) > at > org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getDisksHealthReport(LocalDirsHandlerService.java:287) > at > org.apache.hadoop.yarn.server.nodemanager.NodeHealthCheckerService.getHealthReport(NodeHealthCheckerService.java:58) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.getNodeStatus(NodeStatusUpdaterImpl.java:389) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.access$300(NodeStatusUpdaterImpl.java:83) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:643) > at java.lang.Thread.run(Thread.java:745) > {noformat} > 2. The actual holder of this lock is DiskHealthMonitor: > {noformat} > "DiskHealthMonitor-Timer" #132 daemon prio=5 os_prio=0 tid=0x00007f0397393000 > nid=0x26bd runnable [0x00007f035e511000] > java.lang.Thread.State: RUNNABLE > at java.io.UnixFileSystem.createDirectory(Native Method) > at java.io.File.mkdir(File.java:1316) > at > org.apache.hadoop.util.DiskChecker.mkdirsWithExistsCheck(DiskChecker.java:67) > at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:104) > at > org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.verifyDirUsingMkdir(DirectoryCollection.java:340) > at > org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.testDirs(DirectoryCollection.java:312) > at > org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.checkDirs(DirectoryCollection.java:231) > - locked <0x000000008065eae8> (a > org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection) > at > org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.checkDirs(LocalDirsHandlerService.java:389) > at > org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.access$400(LocalDirsHandlerService.java:50) > at > org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService$MonitoringTimerTask.run(LocalDirsHandlerService.java:122) > at java.util.TimerThread.mainLoop(Timer.java:555) > at java.util.TimerThread.run(Timer.java:505) > {noformat} > This disk operation could take longer time than expectation especially in > high IO throughput case and we should have fine-grained lock for related > operations here. > The same issue on HDFS get raised and fixed in HDFS-7489, and we probably > should have similar fix here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org