[ 
https://issues.apache.org/jira/browse/TEZ-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059331#comment-14059331
 ] 

Siddharth Seth commented on TEZ-1122:
-------------------------------------

canCommit can be a little strange to use. It returns true if a task can commit. 
false however implies that the task attempt is either not running, or some 
other task attempt has started up, etc. This doesn't matter so much since we 
don't have speculation - likely only kicks in if one of the tasks gets lost, in 
which case the RUNNING state check kicks in to kill lost tasks (this would 
already be killed/failed though if a second attempt is running).
canCommit could be tri-state -> Commit, NOT_READY_YET, RACE_LOST so that the 
tasks polling the code know how to operate.

For this jira itself, I don't particularly like the idea of making a sync call 
into the state machine from an RPC thread - even though it works. Should we 
just remove the condition altogether, or move the condition to ensure the task 
has not moved past the RUNNING state (KILL, KILL_IN_PROGRESS, etc).

> Race between canCommit and Task moving into RUNNING state
> ---------------------------------------------------------
>
>                 Key: TEZ-1122
>                 URL: https://issues.apache.org/jira/browse/TEZ-1122
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.4.0
>            Reporter: Siddharth Seth
>            Assignee: Jeff Zhang
>            Priority: Critical
>         Attachments: Tez-1122.patch
>
>
> A task moves into RUNNING state via async events generated after a 
> TaskAttempt moves into RUNNING state, which is triggered by getTask().
> canCommit() is a synchronous call on the umbilical - for short running tasks, 
> a canCommit can come in before the async events are handled.
> {code}
> 2014-05-15 13:21:15,531 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.app.dag.impl.TaskAttemptImpl: TaskAttempt: 
> [attempt_1400183444139_0007_1_00_000000_0] started. Is using containerId: 
> [container_1400183444139_0007_01_000002] on NM: []
> 2014-05-15 13:21:15,533 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.history.HistoryEventHandler: 
> [HISTORY][DAG:dag_1400183444139_0007_1][Event:TASK_ATTEMPT_STARTED]: 
> vertexName=datagen, taskAttemptId=attempt_1400183444139_0007_1_00_000000_0, 
> startTime=1400185273335, containerId=container_1400183444139_0007_01_000002, 
> nodeId=, 
> inProgressLogs=/node/containerlogs/container_1400183444139_0007_01_000002/, 
> completedLogs=localhost:19888/jobhistory/logs///container_1400183444139_0007_01_000002/v_datagen_attempt_1400183444139_0007_1_00_000000_0/
> 2014-05-15 13:21:15,534 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.app.dag.impl.TaskAttemptImpl: 
> attempt_1400183444139_0007_1_00_000000_0 TaskAttempt Transitioned from 
> START_WAIT to RUNNING due to event TA_STARTED_REMOTELY
> 2014-05-15 13:21:15,534 INFO [IPC Server handler 6 on 61779] 
> org.apache.tez.dag.app.dag.impl.TaskImpl: Task not running. Issuing kill to 
> bad commit attempt attempt_1400183444139_0007_1_00_000000_0
> 2014-05-15 13:21:15,534 INFO [AMRM Callback Handler Thread] 
> org.apache.tez.dag.app.rm.TaskScheduler: App total resource memory: 0 cpu: -1 
> taskAllocations: 1
> 2014-05-15 13:21:15,537 INFO [AsyncDispatcher event handler] 
> org.apache.tez.common.counters.Limits: Counter limits initialized with 
> parameters:  GROUP_NAME_MAX=128, MAX_GROUPS=500, COUNTER_NAME_MAX=64, 
> MAX_COUNTERS=1200
> 2014-05-15 13:21:15,541 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.app.dag.impl.TaskImpl: task_1400183444139_0007_1_00_000000 
> Task Transitioned from SCHEDULED to RUNNING
> 2014-05-15 13:21:15,544 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.history.HistoryEventHandler: 
> [HISTORY][DAG:dag_1400183444139_0007_1][Event:TASK_ATTEMPT_FINISHED]: 
> vertexName=datagen, taskAttemptId=attempt_1400183444139_0007_1_00_000000_0, 
> startTime=1400185273335, finishTime=1400185275542, timeTaken=2207, 
> status=KILLED, diagnostics=, counters=Counters: 0
> 2014-05-15 13:21:15,544 INFO [AsyncDispatcher event handler] 
> org.apache.tez.dag.app.dag.impl.TaskAttemptImpl: 
> attempt_1400183444139_0007_1_00_000000_0 TaskAttempt Transitioned from 
> RUNNING to KILL_IN_PROGRESS due to event TA_KILL_REQUEST
> 2014-05-15 13:21:15,546 INFO [TaskSchedulerEventHandlerThread] 
> org.apache.tez.dag.app.rm.TaskSchedulerEventHandler: Processing the event 
> EventType: S_TA_ENDED
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to