[ https://issues.apache.org/jira/browse/YARN-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zian Chen updated YARN-8193: ---------------------------- Description: When running massive queries successively, at some point RM just hangs and stops allocating resources. There's sufficient space given to yarn.nodemanager.local-dirs (not a node health issue, RM didn't report any node being unhealthy). There is no fixed trigger for this (query or operation). This problem goes away on restarting ResourceManager. No NM restart is required. At the point RM get hangs, YARN throw NullPointerException at RegularContainerAllocator.getLocalityWaitFactor was: We were running TPCDS queries successively and at some point RM just hangs and stops allocating resources. There's sufficient space given to yarn.nodemanager.local-dirs (not a node health issue, RM didn't report any node being unhealthy). There is no fixed trigger for this (query or operation). This problem goes away on restarting ResourceManager. No NM restart is required. I have attached RM logs. The application that just finished before the current one is application_1522225930059_0379 The current application (one that hangs) is assigned application number application_1522225930059_0380. > YARN RM hangs abruptly (stops allocating resources) when running successive > applications. > ----------------------------------------------------------------------------------------- > > Key: YARN-8193 > URL: https://issues.apache.org/jira/browse/YARN-8193 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn > Reporter: Zian Chen > Assignee: Zian Chen > Priority: Critical > > When running massive queries successively, at some point RM just hangs and > stops allocating resources. > There's sufficient space given to yarn.nodemanager.local-dirs (not a node > health issue, RM didn't report any node being unhealthy). There is no fixed > trigger for this (query or operation). This problem goes away on restarting > ResourceManager. > No NM restart is required. > At the point RM get hangs, YARN throw NullPointerException at > RegularContainerAllocator.getLocalityWaitFactor > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org