[ 
https://issues.apache.org/jira/browse/TEZ-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533404#comment-14533404
 ] 

Siddharth Seth commented on TEZ-2426:
-------------------------------------

Alright. Have a theory on what's happening. Lots of threads involved. This 
ignores the LOG lines showing up in the wrong log files (assuming the logger 
doesn't guarantee ordering when logging from different threads).

- TaskEventRouter for 456 sees an error. (This can happen because of clean up / 
some fields not being volatile in inputContext).
- TaskEventRouter is swapped out.
- TaskCompletes, sends out it's success message (heartbeat)
- TaskEventRouter thread regains control - tries sending out the TaskFailed 
message. (This is all before the next start has started. It may or may not have 
got an interrupt by this point).
- Main thread falls off. Starts running another task. This thread can heartbeat 
since it doesn't synchronize with the previous tasks heartbeats.
- The TaskEventRouter for 465 regains control. Goes into the IPC layer and 
tries sending the FAILED message (via a future). There's a context switch 
before the futute.get(). The future runs. future.get() is interrupted, because 
the thread has seen it's interrupt status by this point. Leads to the various 
errors in the logs.

This doesn't however explain a status_update after the failed message is sent. 
Don't really see what can cause that.

Couple of things which need fixing here 
1) Join on the TaskEventRouter
2) Join on the last tasks heartbeat thread
3) Fixes to *Context to revert fields back to final, or volatile
4) Avoid sending any more messages once any one final message has been sent.

> Task input not complete before sending Task completed event
> -----------------------------------------------------------
>
>                 Key: TEZ-2426
>                 URL: https://issues.apache.org/jira/browse/TEZ-2426
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Bikas Saha
>            Priority: Critical
>         Attachments: am.log, container.log
>
>
> Sequence of events
> 1) Task A starts in a container
> 2) Task A complete event comes to AM
> 3) Task B starts in the same container
> 4) Task A's input calls some method on its context. Crashes with NPE
> 5) The crash sends an input failed event for Task A to the AM
> 6) Task A state machine crashes saying cannot handle failed after success
> In some cases, it could be that status update event is also sent after 
> completion, though not sure if its related to the failed event being sent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to