[ https://issues.apache.org/jira/browse/YARN-8672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16582661#comment-16582661 ]
Jason Lowe commented on YARN-8672: ---------------------------------- Here's some sample output showing the localization failure which I believe leads to the test timeout: {noformat} 2018-08-14 23:57:37,636 INFO [NM ContainerManager dispatcher] containermanager.AuxServices (AuxServices.java:handle(350)) - Got event CONTAINER_INIT for appId application_0_0000 2018-08-14 23:57:37,636 INFO [NM ContainerManager dispatcher] localizer.ResourceLocalizationService (ResourceLocalizationService.java:handle(789)) - Created localizer for container_0_0000_01_000000 2018-08-14 23:57:37,642 INFO [LocalizerRunner for container_0_0000_01_000000] localizer.ResourceLocalizationService (ResourceLocalizationService.java:writeCredentials(1315)) - Writing credentials to the nmPrivate file /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-localDir/nmPrivate/container_0_0000_01_000000.tokens 2018-08-14 23:57:37,645 INFO [LocalizerRunner for container_0_0000_01_000000] nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:createUserCacheDirs(836)) - Initializing user nobody 2018-08-14 23:57:37,662 INFO [LocalizerRunner for container_0_0000_01_000000] nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:startLocalizer(166)) - Copying from /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-localDir/nmPrivate/container_0_0000_01_000000.tokens to /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-localDir/usercache/nobody/appcache/application_0_0000/container_0_0000_01_000000.tokens 2018-08-14 23:57:37,663 INFO [LocalizerRunner for container_0_0000_01_000000] nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:startLocalizer(174)) - Localizer CWD set to /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-localDir/usercache/nobody/appcache/application_0_0000 = file:/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-localDir/usercache/nobody/appcache/application_0_0000 2018-08-14 23:57:37,704 INFO [NM ContainerManager dispatcher] container.ContainerImpl (ContainerImpl.java:handle(2109)) - Container container_0_0000_01_000000 transitioned from LOCALIZING to SCHEDULED 2018-08-14 23:57:37,705 INFO [NM ContainerManager dispatcher] scheduler.ContainerScheduler (ContainerScheduler.java:startContainer(541)) - Starting container [container_0_0000_01_000000] 2018-08-14 23:57:37,733 INFO [NM ContainerManager dispatcher] container.ContainerImpl (ContainerImpl.java:handle(2109)) - Container container_0_0000_01_000000 transitioned from SCHEDULED to RUNNING 2018-08-14 23:57:37,734 INFO [NM ContainerManager dispatcher] monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:onStartMonitoringContainer(1013)) - Starting resource-monitoring for container_0_0000_01_000000 2018-08-14 23:57:37,771 INFO [ContainersLauncher #0] nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:buildCommandExecutor(370)) - launchContainer: [bash, /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-localDir/usercache/nobody/appcache/application_0_0000/container_0_0000_01_000000/default_container_executor.sh] 2018-08-14 23:57:38,635 INFO [main] containermanager.ContainerManagerImpl (ContainerManagerImpl.java:getContainerStatusInternal(1455)) - Getting container-status for container_0_0000_01_000000 2018-08-14 23:57:38,636 INFO [main] containermanager.ContainerManagerImpl (ContainerManagerImpl.java:getContainerStatusInternal(1469)) - Returning ContainerStatus: [ContainerId: container_0_0000_01_000000, ExecutionType: GUARANTEED, State: RUNNING, Capability: <memory:1024, vCores:1>, Diagnostics: , ExitStatus: -1000, IP: null, Host: null, ContainerSubState: RUNNING] 2018-08-14 23:57:38,636 INFO [main] containermanager.TestContainerManager (BaseContainerManagerTest.java:waitForContainerState(338)) - Waiting for container to get into one of states [RUNNING]. Current state is RUNNING 2018-08-14 23:57:38,636 INFO [main] containermanager.TestContainerManager (BaseContainerManagerTest.java:waitForContainerState(343)) - Container state is RUNNING 2018-08-14 23:57:38,651 INFO [NM ContainerManager dispatcher] localizer.ResourceLocalizationService (ResourceLocalizationService.java:handle(783)) - New REQUEST_RESOURCE_LOCALIZATION localize request for container_0_0000_01_000000, remove old private localizer. 2018-08-14 23:57:38,651 INFO [NM ContainerManager dispatcher] localizer.ResourceLocalizationService (ResourceLocalizationService.java:handle(789)) - Created localizer for container_0_0000_01_000000 2018-08-14 23:57:38,656 INFO [LocalizerRunner for container_0_0000_01_000000] localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(1252)) - Localizer failed for container_0_0000_01_000000 ExitCodeException exitCode=1: chmod: cannot access '/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-localDir/nmPrivate/container_0_0000_01_000000.tokens': No such file or directory at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008) at org.apache.hadoop.util.Shell.run(Shell.java:901) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) at org.apache.hadoop.util.Shell.execCommand(Shell.java:1307) at org.apache.hadoop.util.Shell.execCommand(Shell.java:1289) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:865) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:252) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:232) at org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:331) at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:320) at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:351) at org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1279) at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:100) at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.<init>(ChecksumFs.java:353) at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:400) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:605) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:696) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:692) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.create(FileContext.java:698) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.writeCredentials(ResourceLocalizationService.java:1314) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1229) {noformat} Note the failure in the above stack trace is for the chmod that occurs immediately after creating a local file for an output stream, implying something asynchronously came along and removed the file. When it fails it doesn't always fail with the exact same stacktrace, but the common theme is trying to access the container tokens file at some point when it's missing. > TestContainerManager#testLocalingResourceWhileContainerRunning occasionally > times out > ------------------------------------------------------------------------------------- > > Key: YARN-8672 > URL: https://issues.apache.org/jira/browse/YARN-8672 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 3.2.0 > Reporter: Jason Lowe > Priority: Major > > Precommit builds have been failing in > TestContainerManager#testLocalingResourceWhileContainerRunning. I have been > able to reproduce the problem without any patch applied if I run the test > enough times. It looks like something is removing container tokens from the > nmPrivate area just as a new localizer starts. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org