[jira] [Commented] (TEZ-2338) Tez job failed due to AM Container-Launch failure at windows

Hitesh Shah (JIRA) Mon, 20 Apr 2015 08:10:57 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502992#comment-14502992
 ]


Hitesh Shah commented on TEZ-2338:
----------------------------------

There should be no changes needed for running yarn-tez mode. The RM and NM talk 
to each other over RPC so there should be no file permission issues where the 
NM is trying to read files from the RM host. What the problem could be is that 
if you are running the NM and RM on the same host and have some dirs configured 
wrongly. Check that the NM has full write permissions to both the yarn local 
dirs and yarn log dirs.

Try adding this to your yarn-site.xml:

{code}
    <property>
      <name>yarn.nodemanager.delete.debug-delay-sec</name>
      <value>1200</value>
    </property>
{code} 

This will ensure that the launch_container.cmd from the NM error seen does not 
get deleted ( will remain around for 20 mins - increase 1200 to a higher number 
if needed). Now, what you can do is try and run that launch_container.cmd 
script manually from the container dir and see where it bails out. 


> Tez job failed due to AM Container-Launch failure at windows
> ------------------------------------------------------------
>
>                 Key: TEZ-2338
>                 URL: https://issues.apache.org/jira/browse/TEZ-2338
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>         Environment: Windows server 2012 and Windows-8
> Hadoop-2.5.2
> Java-1.7
>            Reporter: Kaveen Raajan
>
> I successfully Build Tez-0.6.0 against Hadoop-2.5.2
> Then I configured Tez-0.6.0 as like in http://tez.apache.org/install.html
> Moved Tez lib package to HDFS location and updated my tez-site.xml
> {code:xml}
>  <property>
>     <name>tez.lib.uris</name>
> <value>${fs.default.name}/apps/Tez/,${fs.default.name}/apps/Tez/lib/</value>
>   </property>
> {code}
> After that I tried the sample test for tez
> _hadoop jar tez-examples-0.6.0.jar orderedwordcount <input> <output>_
> But I face following error while running this command
> *Note:* I'm using HADOOP High Availability setup.
> {code}
> Running OrderedWordCount
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in [jar:file:/C:/Hadoop/
> share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBind
> er.class]
> SLF4J: Found binding in [jar:file:/C:/Tez/lib
> /slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 15/04/15 10:47:57 INFO client.TezClient: Tez Client Version: [ 
> component=tez-api
> , version=0.6.0, revision=${buildNumber}, 
> SCM-URL=scm:git:https://git-wip-us.apa
> che.org/repos/asf/tez.git, buildTime=2015-04-15T01:13:02Z ]
> 15/04/15 10:48:00 INFO client.TezClient: Submitting DAG application with id: 
> app
> lication_1429073725727_0005
> 15/04/15 10:48:00 INFO Configuration.deprecation: fs.default.name is 
> deprecated.
>  Instead, use fs.defaultFS
> 15/04/15 10:48:00 INFO client.TezClientUtils: Using tez.lib.uris value from 
> conf
> iguration: hdfs://HACluster/apps/Tez/,hdfs://HACluster/apps/Tez/lib/
> 15/04/15 10:48:01 INFO client.TezClient: Stage directory /tmp/app/tez/sta
> ging doesn't exist and is created
> 15/04/15 10:48:01 INFO client.TezClient: Tez system stage directory 
> hdfs://HACluster
> /tmp/app/tez/staging/.tez/application_1429073725727_0005 doesn't ex
> ist and is created
> 15/04/15 10:48:02 INFO client.TezClient: Submitting DAG to YARN, 
> applicationId=a
> pplication_1429073725727_0005, dagName=OrderedWordCount
> 15/04/15 10:48:03 INFO impl.YarnClientImpl: Submitted application 
> application_14
> 29073725727_0005
> 15/04/15 10:48:03 INFO client.TezClient: The url to track the Tez AM: 
> http://MASTER_NN1:8088/proxy/application_1429073725727_0005/
> 15/04/15 10:48:03 INFO client.DAGClientImpl: Waiting for DAG to start running
> 15/04/15 10:48:09 INFO client.DAGClientImpl: DAG completed. FinalState=FAILED
> OrderedWordCount failed with diagnostics: [Application 
> application_1429073725727
> _0005 failed 2 times due to AM Container for 
> appattempt_1429073725727_0005_00000
> 2 exited with  exitCode: -1073741515 due to: Exception from container-launch: 
> Ex
> itCodeException exitCode=-1073741515:
> ExitCodeException exitCode=-1073741515:
>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
>         at org.apache.hadoop.util.Shell.run(Shell.java:455)
>         at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
> 702)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.la
> unchContainer(DefaultContainerExecutor.java:195)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.C
> ontainerLaunch.call(ContainerLaunch.java:300)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.C
> ontainerLaunch.call(ContainerLaunch.java:81)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
> java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
> .java:615)
>         at java.lang.Thread.run(Thread.java:744)
>         1 file(s) moved.
> Container exited with a non-zero exit code -1073741515
> .Failing this attempt.. Failing the application.]
> {code}
> While Seeing at Resourcemanager log:
> {code}
> 2015-04-19 21:49:57,533 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> completedContainer container=Container: [ContainerId: 
> container_1429505171727_0001_02_000001, NodeId: SLAVE1:57794, 
> NodeHttpAddress: SLAVE1:8042, Resource: <memory:2048, vCores:1>, Priority: 0, 
> Token: Token { kind: ContainerToken, service: 172.16.100.92:57794 }, ] 
> queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:0, 
> vCores:0>, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=1, 
> numContainers=0 cluster=<memory:8192, vCores:8>
> 2015-04-19 21:49:57,533 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> completedContainer queue=root usedCapacity=0.0 absoluteUsedCapacity=0.0 
> used=<memory:0, vCores:0> cluster=<memory:8192, vCores:8>
> 2015-04-19 21:49:57,533 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Re-sorting completed queue: root.default stats: default: capacity=1.0, 
> absoluteCapacity=1.0, usedResources=<memory:0, vCores:0>, usedCapacity=0.0, 
> absoluteUsedCapacity=0.0, numApps=1, numContainers=0
> 2015-04-19 21:49:57,533 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Application attempt appattempt_1429505171727_0001_000002 released container 
> container_1429505171727_0001_02_000001 on node: host: SLAVE1:57794 
> #containers=0 available=8192 used=0 with event: FINISHED
> 2015-04-19 21:49:57,580 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Watcher event type: NodeDataChanged with state:UserConnected for 
> path:/rmstore/ZKRMStateRoot/RMAppRoot/application_1429505171727_0001/appattempt_1429505171727_0001_000002
>  for Service 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
> 2015-04-19 21:49:57,580 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
> Unregistering app attempt : appattempt_1429505171727_0001_000002
> 2015-04-19 21:49:57,580 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1429505171727_0001_000002 State change from FINAL_SAVING to FAILED
> 2015-04-19 21:49:57,580 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating 
> application application_1429505171727_0001 with final state: FAILED
> 2015-04-19 21:49:57,580 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
> application_1429505171727_0001 State change from ACCEPTED to FINAL_SAVING
> 2015-04-19 21:49:57,580 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Updating 
> info for app: application_1429505171727_0001
> 2015-04-19 21:49:57,580 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Application Attempt appattempt_1429505171727_0001_000002 is done. 
> finalState=FAILED
> 2015-04-19 21:49:57,580 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: 
> Application application_1429505171727_0001 requests cleared
> 2015-04-19 21:49:57,580 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> Application removed - appId: application_1429505171727_0001 user: SYSTEM 
> queue: default #user-pending-applications: 0 #user-active-applications: 0 
> #queue-pending-applications: 0 #queue-active-applications: 0
> 2015-04-19 21:49:57,611 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> Watcher event type: NodeDataChanged with state:UserConnected for 
> path:/rmstore/ZKRMStateRoot/RMAppRoot/application_1429505171727_0001 for 
> Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore 
> in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: 
> STARTED
> 2015-04-19 21:49:57,611 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application 
> application_1429505171727_0001 failed 2 times due to AM Container for 
> appattempt_1429505171727_0001_000002 exited with  exitCode: -1073741515 due 
> to: Exception from container-launch: ExitCodeException exitCode=-1073741515: 
> ExitCodeException exitCode=-1073741515: 
>       at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
>       at org.apache.hadoop.util.Shell.run(Shell.java:455)
>       at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:744)
>         1 file(s) moved.
> Container exited with a non-zero exit code -1073741515
> .Failing this attempt.. Failing the application.
> 2015-04-19 21:49:57,627 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
> application_1429505171727_0001 State change from FINAL_SAVING to FAILED
> 2015-04-19 21:49:57,627 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> Application removed - appId: application_1429505171727_0001 user: SYSTEM 
> leaf-queue of parent: root #applications: 0
> 2015-04-19 21:49:57,627 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=SYSTEM 
> OPERATION=Application Finished - Failed TARGET=RMAppManager     
> RESULT=FAILURE  DESCRIPTION=App failed with state: FAILED       
> PERMISSIONS=Application application_1429505171727_0001 failed 2 times due to 
> AM Container for appattempt_1429505171727_0001_000002 exited with  exitCode: 
> -1073741515 due to: Exception from container-launch: ExitCodeException 
> exitCode=-1073741515: 
> ExitCodeException exitCode=-1073741515: 
>       at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
>       at org.apache.hadoop.util.Shell.run(Shell.java:455)
>       at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:744)
>         1 file(s) moved.
> Container exited with a non-zero exit code -1073741515
> .Failing this attempt.. Failing the application.      
> APPID=application_1429505171727_0001
> 2015-04-19 21:49:57,627 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary:
>  
> appId=application_1429505171727_0001,name=OrderedWordCount,user=SYSTEM,queue=default,state=FAILED,trackingUrl=http://MASTER_NN1:8088/cluster/app/application_1429505171727_0001,appMasterHost=N/A,startTime=1429505386589,finishTime=1429505397580,finalStatus=FAILED
> 2015-04-19 21:49:58,580 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 
> for port 8032: readAndProcess from client 172.16.100.XX threw exception 
> [java.io.IOException: An existing connection was forcibly closed by the 
> remote host]
> {code}
> At nodemanager logs
> {code}
> 2015-04-20 10:19:59,365 INFO 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: 
> launchContainer: [C:\Hadoop\bin\winutils.exe, task, create, 
> container_1429505171727_0001_02_000001, cmd /c 
> /tmp/hadoop-SLAVE1$/nm-local-dir/usercache/SYSTEM/appcache/application_1429505171727_0001/container_1429505171727_0001_02_000001/default_container_executor.cmd]
> 2015-04-20 10:19:59,436 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code 
> from container container_1429505171727_0001_02_000001 is : -1073741515
> 2015-04-20 10:19:59,437 WARN 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception 
> from container-launch with container ID: 
> container_1429505171727_0001_02_000001 and exit code: -1073741515
> ExitCodeException exitCode=-1073741515: 
>       at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
>       at org.apache.hadoop.util.Shell.run(Shell.java:455)
>       at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:744)
> 2015-04-20 10:19:59,438 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:         1 
> file(s) moved.
> 2015-04-20 10:19:59,439 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Container exited with a non-zero exit code -1073741515
> 2015-04-20 10:19:59,439 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1429505171727_0001_02_000001 transitioned from RUNNING 
> to EXITED_WITH_FAILURE
> 2015-04-20 10:19:59,440 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
>  Cleaning up container container_1429505171727_0001_02_000001
> 2015-04-20 10:19:59,480 INFO 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting 
> absolute path : 
> /tmp/hadoop-SLAVE1$/nm-local-dir/usercache/SYSTEM/appcache/application_1429505171727_0001/container_1429505171727_0001_02_000001
> 2015-04-20 10:19:59,480 WARN 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=SYSTEM     
> OPERATION=Container Finished - Failed   TARGET=ContainerImpl    
> RESULT=FAILURE  DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE  
>   APPID=application_1429505171727_0001    
> CONTAINERID=container_1429505171727_0001_02_000001
> 2015-04-20 10:19:59,481 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
>  Container container_1429505171727_0001_02_000001 transitioned from 
> EXITED_WITH_FAILURE to DONE
> 2015-04-20 10:19:59,481 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
>  Removing container_1429505171727_0001_02_000001 from application 
> application_1429505171727_0001
> 2015-04-20 10:19:59,481 INFO 
> org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: ProcfsBasedProcessTree 
> currently is supported only on Linux.
> {code}
> Problem might be while connecting to nodemanager it unable to handshake with 
> ResourceManager.
> If I try in single node hadoop cluster mean It working correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-2338) Tez job failed due to AM Container-Launch failure at windows

Reply via email to