[ 
https://issues.apache.org/jira/browse/TEZ-2863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176541#comment-15176541
 ] 

Hitesh Shah edited comment on TEZ-2863 at 3/2/16 9:46 PM:
----------------------------------------------------------

[~jeagles] The same event object is used for both recovery and history but the 
data written out is different. The general approach usually has been that if 
something is needed for recovery but can be constructed from other data, it 
does not need to be written to recovery. For example, at the task attempt 
level, vertexname and the log urls would be re-constructible if needed. 

bq. Also, I don't have an exact understanding for what fails by not having 
these correctly filled out, since this patch works in practice for filling out 
the in progress and completed urls. Must be related to recovery. Can you help 
to explain?

The main thing is that after recovering state, the object should not end up in 
a situation where it expects a particular field to be non-null but ends up 
being null. 


was (Author: hitesh):
[~jeagles] The same event object is used for both recovery and history but the 
data written out is different. The general approach usually has been that if 
something is needed for recovery but can be constructed from other data, it 
does not need to be written to recovery. For example, at the task attempt 
level, vertexname and the log urls would be re-constructible if needed. 

> Container, node, and logs not available in UI for tasks that fail to launch
> ---------------------------------------------------------------------------
>
>                 Key: TEZ-2863
>                 URL: https://issues.apache.org/jira/browse/TEZ-2863
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Jonathan Eagles
>            Assignee: Jonathan Eagles
>         Attachments: TEZ-2863.1.patch, TEZ-2863.2-branch-0.7.patch, 
> TEZ-2863.2.patch, TEZ-2863.3-branch-0.7.patch, 
> TEZ-2863.3-branch-0.7.patch.addendum, TEZ-2863.3.patch, 
> TEZ-2863.3.patch.addendum, TEZ-2863.4-branch-0.7.patch, TEZ-2863.4.patch
>
>
> While running a sample tez job
> {noformat}
> tez-examples-*.jar orderedwordcount -Dtez.task.resource.memory.mb=1 
> -Dtez.task.launch.cmd-opts="-Xmx1m" input output
> {noformat}
> It was noticed that the Tez UI task attempt 
> http://timelineserverhost:port/ws/v1/timeline/TEZ_TASK_ATTEMPT_ID/attempt_id 
> was missing the TEZ_ATTEMPT_STARTED event
> {noformat}
> 2015-10-01 10:03:55,344 [INFO] [Dispatcher thread {Central}] 
> |history.HistoryEventHandler|: 
> [HISTORY][DAG:dag_1443711816411_0001_1][Event:TASK_STARTED]: 
> vertexName=Tokenizer, taskId=task_1443711816411_0001_1_00_000000, 
> scheduledTime=1443711835342, launchTime=1443711835342
> 2015-10-01 10:03:55,346 [INFO] [Dispatcher thread {Central}] 
> |util.RackResolver|: Resolved localhost to /default-rack
> 2015-10-01 10:03:55,356 [INFO] [TaskSchedulerEventHandlerThread] 
> |util.RackResolver|: Resolved localhost to /default-rack
> 2015-10-01 10:03:55,364 [INFO] [TaskSchedulerEventHandlerThread] 
> |rm.YarnTaskSchedulerService|: Allocation request for task: 
> attempt_1443711816411_0001_1_00_000000_0 with request: Capability[<memory:1, 
> vCores:1>]Priority[2] host: localhost rack: null
> 2015-10-01 10:03:56,639 [INFO] [AMRM Heartbeater thread] 
> |impl.AMRMClientImpl|: Received new token for : localhost:57381
> 2015-10-01 10:03:56,646 [INFO] [AMRM Callback Handler Thread] 
> |util.RackResolver|: Resolved localhost to /default-rack
> 2015-10-01 10:03:56,648 [INFO] [DelayedContainerManager] 
> |rm.YarnTaskSchedulerService|: Assigning container to task: 
> containerId=container_1443711816411_0001_01_000002, 
> task=attempt_1443711816411_0001_1_00_000000_0, containerHost=localhost:57381, 
> containerPriority= 2, containerResources=<memory:1024, vCores:1>, 
> localityMatchType=NodeLocal, matchedLocation=localhost, 
> honorLocalityFlags=true, reusedContainer=false, delayedContainers=0
> 2015-10-01 10:03:56,649 [INFO] [DelayedContainerManager] |util.RackResolver|: 
> Resolved localhost to /default-rack
> 2015-10-01 10:03:56,649 [INFO] [DelayedContainerManager] |util.RackResolver|: 
> Resolved localhost to /default-rack
> 2015-10-01 10:03:56,686 [INFO] [TaskSchedulerAppCaller #0] 
> |node.AMNodeTracker|: Adding new node: localhost:57381
> 2015-10-01 10:03:56,700 [INFO] [ContainerLauncher #0] 
> |launcher.ContainerLauncherImpl|: Launching 
> container_1443711816411_0001_01_000002
> 2015-10-01 10:03:56,700 [INFO] [ContainerLauncher #0] 
> |impl.ContainerManagementProtocolProxy|: Opening proxy : localhost:57381
> 2015-10-01 10:03:56,741 [INFO] [ContainerLauncher #0] 
> |history.HistoryEventHandler|: [HISTORY][DAG:N/A][Event:CONTAINER_LAUNCHED]: 
> containerId=container_1443711816411_0001_01_000002, launchTime=1443711836741
> 2015-10-01 10:03:57,647 [INFO] [AMRM Callback Handler Thread] 
> |rm.YarnTaskSchedulerService|: Allocated container 
> completed:container_1443711816411_0001_01_000002 last allocated to task: 
> attempt_1443711816411_0001_1_00_000000_0
> 2015-10-01 10:03:57,648 [INFO] [Dispatcher thread {Central}] 
> |container.AMContainerImpl|: Container container_1443711816411_0001_01_000002 
> exited with diagnostics set to Container failed, exitCode=1. Exception from 
> container-launch.
> Container id: container_1443711816411_0001_01_000002
> Exit code: 1
> Stack trace: ExitCodeException exitCode=1: 
>       at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
>       at org.apache.hadoop.util.Shell.run(Shell.java:455)
>       at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> Container exited with a non-zero exit code 1
> 2015-10-01 10:03:57,649 [INFO] [Dispatcher thread {Central}] 
> |history.HistoryEventHandler|: 
> [HISTORY][DAG:dag_1443711816411_0001_1][Event:CONTAINER_STOPPED]: 
> containerId=container_1443711816411_0001_01_000002, 
> stoppedTime=1443711837649, exitStatus=1
> 2015-10-01 10:03:57,652 [INFO] [Dispatcher thread {Central}] 
> |history.HistoryEventHandler|: 
> [HISTORY][DAG:dag_1443711816411_0001_1][Event:TASK_ATTEMPT_FINISHED]: 
> vertexName=Tokenizer, taskAttemptId=attempt_1443711816411_0001_1_00_000000_0, 
> creationTime=1443711835341, allocationTime=0, startTime=0, 
> finishTime=1443711837650, timeTaken=1443711837650, status=FAILED, 
> errorEnum=CONTAINER_EXITED, diagnostics=Container 
> container_1443711816411_0001_01_000002 finished with diagnostics set to 
> [Container failed, exitCode=1. Exception from container-launch.
> Container id: container_1443711816411_0001_01_000002
> Exit code: 1
> Stack trace: ExitCodeException exitCode=1: 
>       at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
>       at org.apache.hadoop.util.Shell.run(Shell.java:455)
>       at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> Container exited with a non-zero exit code 1
> ], counters=Counters: 0
> 2015-10-01 10:03:57,653 [INFO] [Dispatcher thread {Central}] 
> |counters.Limits|: Counter limits initialized with parameters:  
> GROUP_NAME_MAX=256, MAX_GROUPS=500, COUNTER_NAME_MAX=64, MAX_COUNTERS=1200
> 2015-10-01 10:03:57,657 [INFO] [Dispatcher thread {Central}] |impl.TaskImpl|: 
> Scheduling new attempt for task: task_1443711816411_0001_1_00_000000, 
> currentFailedAttempts: 1, maxFailedAttempts: 4
> 2015-10-01 10:03:57,658 [INFO] [TaskSchedulerEventHandlerThread] 
> |rm.YarnTaskSchedulerService|: Ignoring removal of unknown task: 
> attempt_1443711816411_0001_1_00_000000_0
> 2015-10-01 10:03:57,658 [INFO] [TaskSchedulerEventHandlerThread] 
> |rm.TaskSchedulerEventHandler|: Task: 
> attempt_1443711816411_0001_1_00_000000_0 has no container assignment in the 
> scheduler
> 2015-10-01 10:03:57,658 [INFO] [Dispatcher thread {Central}] 
> |impl.VertexImpl|: Source task attempt completed for vertex: 
> vertex_1443711816411_0001_1_01 [Summation] attempt: 
> attempt_1443711816411_0001_1_00_000000_0 with state: FAILED vertexState: 
> RUNNING
> {noformat}
> The TASK_ATTEMPTED_STARTED event contains the inProgressURL, however. It 
> makes it very difficult for the user to debug their bad jvm args.
> The _stdout_ for the failed container has the launch failure reason in this 
> sceario
> {noformat} 
> Error occurred during initialization of VM
> Too small initial heap for new size specified
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to