[ 
https://issues.apache.org/jira/browse/YARN-8672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16582661#comment-16582661
 ] 

Jason Lowe commented on YARN-8672:
----------------------------------

Here's some sample output showing the localization failure which I believe 
leads to the test timeout:
{noformat}
2018-08-14 23:57:37,636 INFO  [NM ContainerManager dispatcher] 
containermanager.AuxServices (AuxServices.java:handle(350)) - Got event 
CONTAINER_INIT for appId application_0_0000
2018-08-14 23:57:37,636 INFO  [NM ContainerManager dispatcher] 
localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:handle(789)) - Created localizer for 
container_0_0000_01_000000
2018-08-14 23:57:37,642 INFO  [LocalizerRunner for container_0_0000_01_000000] 
localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:writeCredentials(1315)) - Writing credentials 
to the nmPrivate file 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-localDir/nmPrivate/container_0_0000_01_000000.tokens
2018-08-14 23:57:37,645 INFO  [LocalizerRunner for container_0_0000_01_000000] 
nodemanager.DefaultContainerExecutor 
(DefaultContainerExecutor.java:createUserCacheDirs(836)) - Initializing user 
nobody
2018-08-14 23:57:37,662 INFO  [LocalizerRunner for container_0_0000_01_000000] 
nodemanager.DefaultContainerExecutor 
(DefaultContainerExecutor.java:startLocalizer(166)) - Copying from 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-localDir/nmPrivate/container_0_0000_01_000000.tokens
 to 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-localDir/usercache/nobody/appcache/application_0_0000/container_0_0000_01_000000.tokens
2018-08-14 23:57:37,663 INFO  [LocalizerRunner for container_0_0000_01_000000] 
nodemanager.DefaultContainerExecutor 
(DefaultContainerExecutor.java:startLocalizer(174)) - Localizer CWD set to 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-localDir/usercache/nobody/appcache/application_0_0000
 = 
file:/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-localDir/usercache/nobody/appcache/application_0_0000
2018-08-14 23:57:37,704 INFO  [NM ContainerManager dispatcher] 
container.ContainerImpl (ContainerImpl.java:handle(2109)) - Container 
container_0_0000_01_000000 transitioned from LOCALIZING to SCHEDULED
2018-08-14 23:57:37,705 INFO  [NM ContainerManager dispatcher] 
scheduler.ContainerScheduler (ContainerScheduler.java:startContainer(541)) - 
Starting container [container_0_0000_01_000000]
2018-08-14 23:57:37,733 INFO  [NM ContainerManager dispatcher] 
container.ContainerImpl (ContainerImpl.java:handle(2109)) - Container 
container_0_0000_01_000000 transitioned from SCHEDULED to RUNNING
2018-08-14 23:57:37,734 INFO  [NM ContainerManager dispatcher] 
monitor.ContainersMonitorImpl 
(ContainersMonitorImpl.java:onStartMonitoringContainer(1013)) - Starting 
resource-monitoring for container_0_0000_01_000000
2018-08-14 23:57:37,771 INFO  [ContainersLauncher #0] 
nodemanager.DefaultContainerExecutor 
(DefaultContainerExecutor.java:buildCommandExecutor(370)) - launchContainer: 
[bash, 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-localDir/usercache/nobody/appcache/application_0_0000/container_0_0000_01_000000/default_container_executor.sh]
2018-08-14 23:57:38,635 INFO  [main] containermanager.ContainerManagerImpl 
(ContainerManagerImpl.java:getContainerStatusInternal(1455)) - Getting 
container-status for container_0_0000_01_000000
2018-08-14 23:57:38,636 INFO  [main] containermanager.ContainerManagerImpl 
(ContainerManagerImpl.java:getContainerStatusInternal(1469)) - Returning 
ContainerStatus: [ContainerId: container_0_0000_01_000000, ExecutionType: 
GUARANTEED, State: RUNNING, Capability: <memory:1024, vCores:1>, Diagnostics: , 
ExitStatus: -1000, IP: null, Host: null, ContainerSubState: RUNNING]
2018-08-14 23:57:38,636 INFO  [main] containermanager.TestContainerManager 
(BaseContainerManagerTest.java:waitForContainerState(338)) - Waiting for 
container to get into one of states [RUNNING]. Current state is RUNNING
2018-08-14 23:57:38,636 INFO  [main] containermanager.TestContainerManager 
(BaseContainerManagerTest.java:waitForContainerState(343)) - Container state is 
RUNNING
2018-08-14 23:57:38,651 INFO  [NM ContainerManager dispatcher] 
localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:handle(783)) - New 
REQUEST_RESOURCE_LOCALIZATION localize request for container_0_0000_01_000000, 
remove old private localizer.
2018-08-14 23:57:38,651 INFO  [NM ContainerManager dispatcher] 
localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:handle(789)) - Created localizer for 
container_0_0000_01_000000
2018-08-14 23:57:38,656 INFO  [LocalizerRunner for container_0_0000_01_000000] 
localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:run(1252)) - Localizer failed for 
container_0_0000_01_000000
ExitCodeException exitCode=1: chmod: cannot access 
'/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-localDir/nmPrivate/container_0_0000_01_000000.tokens':
 No such file or directory

        at org.apache.hadoop.util.Shell.runCommand(Shell.java:1008)
        at org.apache.hadoop.util.Shell.run(Shell.java:901)
        at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:1307)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:1289)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:865)
        at 
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:252)
        at 
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:232)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:331)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:320)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:351)
        at org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1279)
        at 
org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:100)
        at 
org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.<init>(ChecksumFs.java:353)
        at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:400)
        at 
org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:605)
        at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:696)
        at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:692)
        at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
        at org.apache.hadoop.fs.FileContext.create(FileContext.java:698)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.writeCredentials(ResourceLocalizationService.java:1314)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1229)
{noformat}

Note the failure in the above stack trace is for the chmod that occurs 
immediately after creating a local file for an output stream, implying 
something asynchronously came along and removed the file.  When it fails it 
doesn't always fail with the exact same stacktrace, but the common theme is 
trying to access the container tokens file at some point when it's missing.

> TestContainerManager#testLocalingResourceWhileContainerRunning occasionally 
> times out
> -------------------------------------------------------------------------------------
>
>                 Key: YARN-8672
>                 URL: https://issues.apache.org/jira/browse/YARN-8672
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 3.2.0
>            Reporter: Jason Lowe
>            Priority: Major
>
> Precommit builds have been failing in 
> TestContainerManager#testLocalingResourceWhileContainerRunning.  I have been 
> able to reproduce the problem without any patch applied if I run the test 
> enough times.  It looks like something is removing container tokens from the 
> nmPrivate area just as a new localizer starts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to