[ https://issues.apache.org/jira/browse/TEZ-2863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184052#comment-15184052 ]
Hitesh Shah commented on TEZ-2863: ---------------------------------- +1. > Container, node, and logs not available in UI for tasks that fail to launch > --------------------------------------------------------------------------- > > Key: TEZ-2863 > URL: https://issues.apache.org/jira/browse/TEZ-2863 > Project: Apache Tez > Issue Type: Bug > Reporter: Jonathan Eagles > Assignee: Jonathan Eagles > Attachments: TEZ-2863.1.patch, TEZ-2863.2-branch-0.7.patch, > TEZ-2863.2.patch, TEZ-2863.3-branch-0.7.patch, > TEZ-2863.3-branch-0.7.patch.addendum, TEZ-2863.3.patch, > TEZ-2863.3.patch.addendum, TEZ-2863.4-branch-0.7.patch, TEZ-2863.4.patch, > TEZ-2863.5-branch-0.7.patch, TEZ-2863.5.patch > > > While running a sample tez job > {noformat} > tez-examples-*.jar orderedwordcount -Dtez.task.resource.memory.mb=1 > -Dtez.task.launch.cmd-opts="-Xmx1m" input output > {noformat} > It was noticed that the Tez UI task attempt > http://timelineserverhost:port/ws/v1/timeline/TEZ_TASK_ATTEMPT_ID/attempt_id > was missing the TEZ_ATTEMPT_STARTED event > {noformat} > 2015-10-01 10:03:55,344 [INFO] [Dispatcher thread {Central}] > |history.HistoryEventHandler|: > [HISTORY][DAG:dag_1443711816411_0001_1][Event:TASK_STARTED]: > vertexName=Tokenizer, taskId=task_1443711816411_0001_1_00_000000, > scheduledTime=1443711835342, launchTime=1443711835342 > 2015-10-01 10:03:55,346 [INFO] [Dispatcher thread {Central}] > |util.RackResolver|: Resolved localhost to /default-rack > 2015-10-01 10:03:55,356 [INFO] [TaskSchedulerEventHandlerThread] > |util.RackResolver|: Resolved localhost to /default-rack > 2015-10-01 10:03:55,364 [INFO] [TaskSchedulerEventHandlerThread] > |rm.YarnTaskSchedulerService|: Allocation request for task: > attempt_1443711816411_0001_1_00_000000_0 with request: Capability[<memory:1, > vCores:1>]Priority[2] host: localhost rack: null > 2015-10-01 10:03:56,639 [INFO] [AMRM Heartbeater thread] > |impl.AMRMClientImpl|: Received new token for : localhost:57381 > 2015-10-01 10:03:56,646 [INFO] [AMRM Callback Handler Thread] > |util.RackResolver|: Resolved localhost to /default-rack > 2015-10-01 10:03:56,648 [INFO] [DelayedContainerManager] > |rm.YarnTaskSchedulerService|: Assigning container to task: > containerId=container_1443711816411_0001_01_000002, > task=attempt_1443711816411_0001_1_00_000000_0, containerHost=localhost:57381, > containerPriority= 2, containerResources=<memory:1024, vCores:1>, > localityMatchType=NodeLocal, matchedLocation=localhost, > honorLocalityFlags=true, reusedContainer=false, delayedContainers=0 > 2015-10-01 10:03:56,649 [INFO] [DelayedContainerManager] |util.RackResolver|: > Resolved localhost to /default-rack > 2015-10-01 10:03:56,649 [INFO] [DelayedContainerManager] |util.RackResolver|: > Resolved localhost to /default-rack > 2015-10-01 10:03:56,686 [INFO] [TaskSchedulerAppCaller #0] > |node.AMNodeTracker|: Adding new node: localhost:57381 > 2015-10-01 10:03:56,700 [INFO] [ContainerLauncher #0] > |launcher.ContainerLauncherImpl|: Launching > container_1443711816411_0001_01_000002 > 2015-10-01 10:03:56,700 [INFO] [ContainerLauncher #0] > |impl.ContainerManagementProtocolProxy|: Opening proxy : localhost:57381 > 2015-10-01 10:03:56,741 [INFO] [ContainerLauncher #0] > |history.HistoryEventHandler|: [HISTORY][DAG:N/A][Event:CONTAINER_LAUNCHED]: > containerId=container_1443711816411_0001_01_000002, launchTime=1443711836741 > 2015-10-01 10:03:57,647 [INFO] [AMRM Callback Handler Thread] > |rm.YarnTaskSchedulerService|: Allocated container > completed:container_1443711816411_0001_01_000002 last allocated to task: > attempt_1443711816411_0001_1_00_000000_0 > 2015-10-01 10:03:57,648 [INFO] [Dispatcher thread {Central}] > |container.AMContainerImpl|: Container container_1443711816411_0001_01_000002 > exited with diagnostics set to Container failed, exitCode=1. Exception from > container-launch. > Container id: container_1443711816411_0001_01_000002 > Exit code: 1 > Stack trace: ExitCodeException exitCode=1: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) > at org.apache.hadoop.util.Shell.run(Shell.java:455) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Container exited with a non-zero exit code 1 > 2015-10-01 10:03:57,649 [INFO] [Dispatcher thread {Central}] > |history.HistoryEventHandler|: > [HISTORY][DAG:dag_1443711816411_0001_1][Event:CONTAINER_STOPPED]: > containerId=container_1443711816411_0001_01_000002, > stoppedTime=1443711837649, exitStatus=1 > 2015-10-01 10:03:57,652 [INFO] [Dispatcher thread {Central}] > |history.HistoryEventHandler|: > [HISTORY][DAG:dag_1443711816411_0001_1][Event:TASK_ATTEMPT_FINISHED]: > vertexName=Tokenizer, taskAttemptId=attempt_1443711816411_0001_1_00_000000_0, > creationTime=1443711835341, allocationTime=0, startTime=0, > finishTime=1443711837650, timeTaken=1443711837650, status=FAILED, > errorEnum=CONTAINER_EXITED, diagnostics=Container > container_1443711816411_0001_01_000002 finished with diagnostics set to > [Container failed, exitCode=1. Exception from container-launch. > Container id: container_1443711816411_0001_01_000002 > Exit code: 1 > Stack trace: ExitCodeException exitCode=1: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) > at org.apache.hadoop.util.Shell.run(Shell.java:455) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Container exited with a non-zero exit code 1 > ], counters=Counters: 0 > 2015-10-01 10:03:57,653 [INFO] [Dispatcher thread {Central}] > |counters.Limits|: Counter limits initialized with parameters: > GROUP_NAME_MAX=256, MAX_GROUPS=500, COUNTER_NAME_MAX=64, MAX_COUNTERS=1200 > 2015-10-01 10:03:57,657 [INFO] [Dispatcher thread {Central}] |impl.TaskImpl|: > Scheduling new attempt for task: task_1443711816411_0001_1_00_000000, > currentFailedAttempts: 1, maxFailedAttempts: 4 > 2015-10-01 10:03:57,658 [INFO] [TaskSchedulerEventHandlerThread] > |rm.YarnTaskSchedulerService|: Ignoring removal of unknown task: > attempt_1443711816411_0001_1_00_000000_0 > 2015-10-01 10:03:57,658 [INFO] [TaskSchedulerEventHandlerThread] > |rm.TaskSchedulerEventHandler|: Task: > attempt_1443711816411_0001_1_00_000000_0 has no container assignment in the > scheduler > 2015-10-01 10:03:57,658 [INFO] [Dispatcher thread {Central}] > |impl.VertexImpl|: Source task attempt completed for vertex: > vertex_1443711816411_0001_1_01 [Summation] attempt: > attempt_1443711816411_0001_1_00_000000_0 with state: FAILED vertexState: > RUNNING > {noformat} > The TASK_ATTEMPTED_STARTED event contains the inProgressURL, however. It > makes it very difficult for the user to debug their bad jvm args. > The _stdout_ for the failed container has the launch failure reason in this > sceario > {noformat} > Error occurred during initialization of VM > Too small initial heap for new size specified > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)